Showing posts with label makefile. Show all posts
Showing posts with label makefile. Show all posts

17 May 2016

finding new intron-exon junctions using the public Encode RNASeq data

I've been asked to look for some new / suspected / previously uncharacterized intron-exon junctions in public RNASeq data.
I've used the BAMs under http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/.

The following command is used to build the list of BAMs:

curl -s  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/" |\
tr ' <>"' "\n" | grep -F .bam | grep -v bai | sort | uniq | sed 's/.bam$//' | sed 's/$/ \\/' 

wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400SplicesRep2V2 \
(...)

This list is inserted as a list named SAMPLES a Makefile.

For each BAM, we use samtools to retrieve the reads in the region(s)) of interest. The reads are then filtered with samjs (https://github.com/lindenb/jvarkit/wiki/SamJS) to only keep the reads carrying an intron-exon junction at the desired location(s). Basically, the javascript-based filter loops over the CIGAR string of the read, computes the genomic interval skipped when the cigar operator is a deletion or a skipped region/intron. The read is printed if it describes the new intron-exon junction.

All in one:




That's it,

Pierre



18 June 2015

Playing with the #GA4GH schemas and #Avro : my notebook

After watching David Haussler's talk "Beacon Project and Data Sharing ApIs", I wanted to play with Avro and the models and APIs defined by the Global Alliance for Genomics and Health (ga4gh) coalition Here is my notebook.
(Wikipedia) Avro: "Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services."
First, we download the java tools and libraries for apache Avro
curl -L -o avro-tools-1.7.7.jar "http://www.eng.lsu.edu/mirrors/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar"
Next, we download the schemas defined by the ga4gh from github
curl -L -o schema.zip "https://github.com/ga4gh/schemas/archive/v0.5.1.zip"
unzip schema.zip
rm schema.zip

$ find -name "*.avdl"
./schemas-0.5.1/src/main/resources/avro/readmethods.avdl
./schemas-0.5.1/src/main/resources/avro/common.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadata.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadatamethods.avdl
./schemas-0.5.1/src/main/resources/avro/wip/variationReference.avdl
./schemas-0.5.1/src/main/resources/avro/variants.avdl
./schemas-0.5.1/src/main/resources/avro/variantmethods.avdl
./schemas-0.5.1/src/main/resources/avro/beacon.avdl
./schemas-0.5.1/src/main/resources/avro/references.avdl
./schemas-0.5.1/src/main/resources/avro/referencemethods.avdl
./schemas-0.5.1/src/main/resources/avro/reads.avdl
Those schema can be compiled to java using the avro-tools
$ java -jar avro-tools-1.7.7.jar compile protocol schemas-0.5.1/src/main/resources/avro/ ./generated
Input files to compile:
  schemas-0.5.1/src/main/resources/avro/variants.avpr
  
$ find generated/org/ -name "*.java"
generated/org/ga4gh/GAPosition.java
generated/org/ga4gh/GAVariantSetMetadata.java
generated/org/ga4gh/GACall.java
generated/org/ga4gh/GAException.java
generated/org/ga4gh/GACigarOperation.java
generated/org/ga4gh/GAVariantSet.java
generated/org/ga4gh/GAVariants.java
generated/org/ga4gh/GAVariant.java
generated/org/ga4gh/GACallSet.java
generated/org/ga4gh/GACigarUnit.java
As a test, the following java source uses the classes generated by avro to create nine variants and serialize them to Avro

Compile, archive and execute:
#compile classes
javac -d generated -cp avro-tools-1.7.7.jar -sourcepath generated:src generated/org/ga4gh/*.java src/test/TestAvro.java
# archive
jar cvf generated/ga4gh.jar -C generated org -C generated test
# run
java -cp avro-tools-1.7.7.jar:generated/ga4gh.jar test.TestAvro > variant.avro
We use the avro-tools to convert the generated file variant.avro to json
java -jar avro-tools-1.7.7.jar tojson variant.avro

Output:

The complete Makefile



That's it,
Pierre

05 December 2014

Divide-and-conquer in a #Makefile : recursivity and #parallelism.

This post is my notebook about implementing a divide-and-conquer strategy in GNU make.
Say you have a list of 'N' VCFs files. You want to create a list of:

  • common SNPs in vcf1 and vcf2
  • common SNPs in vcf3 and the previous list
  • common SNPs in vcf4 and the previous list
  • (...)
  • common SNPs in vcfN and the previous list
Yes, I know I can do this using:grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq

Using a linear Makefile it could look like:

list2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list3: vcf3 list2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list4: vcf4 list3
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list5: vcf5 list4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
(...)

We can speed-up the workflow using the parallel option of make -j (number-of-parallel-jobs) and using a divide-and-conquer strategy. Here, the targets 'list1_2' and 'list3_4' can be processed independently in parallel.

list1_2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list3_4: vcf3 vcf4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list1_4: list1_2 list3_4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

By using the internal 'Make' functions $(eval), $(call), $(shell) we can define a recursive method "recursive". This method takes two arguments which are the 0-based indexes of a VCF in the list of VCFs. Here is the Makefile:
Running the makefile:
$ make 
gunzip -c Sample18.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target0_1
gunzip -c Sample13.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target1_2
LC_ALL=C comm -12 target0_1 target1_2 > target0_2
gunzip -c Sample1.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target2_3
gunzip -c Sample19.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target3_4
gunzip -c Sample12.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target4_5
LC_ALL=C comm -12 target3_4 target4_5 > target3_5
LC_ALL=C comm -12 target2_3 target3_5 > target2_5
LC_ALL=C comm -12 target0_2 target2_5 > target0_5
gunzip -c Sample17.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target5_6
gunzip -c Sample16.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target6_7
LC_ALL=C comm -12 target5_6 target6_7 > target5_7
gunzip -c Sample9.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target7_8
gunzip -c Sample15.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target8_9
gunzip -c Sample5.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target9_10
LC_ALL=C comm -12 target8_9 target9_10 > target8_10
LC_ALL=C comm -12 target7_8 target8_10 > target7_10
LC_ALL=C comm -12 target5_7 target7_10 > target5_10
LC_ALL=C comm -12 target0_5 target5_10 > target0_10
gunzip -c Sample14.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target10_11
gunzip -c Sample3.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target11_12
LC_ALL=C comm -12 target10_11 target11_12 > target10_12
gunzip -c Sample11.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target12_13
gunzip -c Sample2.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target13_14
gunzip -c Sample6.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target14_15
LC_ALL=C comm -12 target13_14 target14_15 > target13_15
LC_ALL=C comm -12 target12_13 target13_15 > target12_15
LC_ALL=C comm -12 target10_12 target12_15 > target10_15
gunzip -c Sample20.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target15_16
gunzip -c Sample10.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target16_17
LC_ALL=C comm -12 target15_16 target16_17 > target15_17
gunzip -c Sample4.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target17_18
gunzip -c Sample8.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target18_19
gunzip -c Sample7.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target19_20
LC_ALL=C comm -12 target18_19 target19_20 > target18_20
LC_ALL=C comm -12 target17_18 target18_20 > target17_20
LC_ALL=C comm -12 target15_17 target17_20 > target15_20
LC_ALL=C comm -12 target10_15 target15_20 > target10_20
LC_ALL=C comm -12 target0_10 target10_20 > target0_20
and here is the generated workflow (drawn with make2graph ).
:
That's it
Pierre.

04 December 2014

XML+XSLT = #Makefile -based #workflows for #bioinformatics

I've recently read some conversations on Twitter about Makefile-based bioinformatics workflows. I've suggested on biostars.org (Standard simple format to describe a bioinformatics analysis pipeline) that a XML file could be used to describe a model of data and XSLT could transform this model to a Makefile-based workflow. I've already explored this idea in a previous post (Generating a pipeline of analysis (Makefile) for NGS with xslt. ) and in our lab, we use JSON and jsvelocity to generate our workflows (e.g: https://gist.github.com/lindenb/3c07ca722f793cc5dd60).
In the current post, I'll describe my github repository containing a complete XML+XSLT example: https://github.com/lindenb/ngsxml.

Download the data

git clone "https://github.com/lindenb/ngsxml.git"


The model
The XML model is self-explanatory:
<?xml version="1.0" encoding="UTF-8"?>
<model name="myProject" description="my project" directory="OUT">
  <project name="Proj1">
    <sample name="Sample1">
      <fastq>
        <for>test/fastq/sample_1_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_01_R2.fastq.gz</rev>
      </fastq>
      <fastq id="groupid2" lane="2" library="lib1" platform="ILMN" median-size="98">
        <for>test/fastq/sample_1_02_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_02_R2.fastq.gz</rev>
      </fastq>
    </sample>
    <sample name="Sample2">
      <fastq>
        <for>test/fastq/sample_2_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_2_01_R2.fastq.gz</rev>
      </fastq>
    </sample>
  </project>
</model>


Validating the model:
This XML document is possibly validated with a XML schema:
$ xmllint  --schema xsd/schema01.xsd --noout test/model01.xml
test/model01.xml validates


Generating the Makefile:
The XML document is transformed into a Makefile using the following XSLT stylesheet: https://github.com/lindenb/ngsxml/blob/master/stylesheets/model2make.xsl
xsltproc --output Makefile stylesheets/model2make.xsl test/model01.xml
The Makefile:
# Name
#  myProject
# Description:
#  my project
# 
include config.mk
OUTDIR=OUT
BINDIR=$(abspath ${OUTDIR})/bin


# if tools are undefined
bwa.exe ?=${BINDIR}/bwa-0.7.10/bwa
samtools.exe ?=${BINDIR}/samtools-0.1.19/samtools
bcftools.exe ?=${BINDIR}/samtools-0.1.19/bcftools/bcftools
tabix.exe ?=${BINDIR}/tabix-0.2.6/tabix
bgzip.exe ?=${BINDIR}/tabix-0.2.6/bgzip

.PHONY=all clean all_bams all_vcfs

all: all_vcfs


all_vcfs:  \
 $(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz


all_bams:  \
 $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam \
 $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam

#
# VCF for project 'Proj1'
# 
$(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz : $(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam) \
 $(addsuffix .fai,${REFERENCE}) ${samtools.exe}  ${bgzip.exe} ${tabix.exe} ${bcftools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} mpileup -uf ${REFERENCE} $(basename $(filter %.bai,$^)) | \
 ${bcftools.exe} view -vcg - > $(basename $@)  && \
 ${bgzip.exe} -f $(basename $@) && \
 ${tabix.exe} -f -p vcf $@
 
    



#
# index final BAM for Sample 'Sample1'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample1'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@



#
# merge BAMs 
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam :  \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
  ${samtools.exe} merge -f $@ $(filter %.bam,$^)
  



 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_01_R1.fastq.gz and test/fastq/sample_1_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_1_01_R1.fastq.gz  \
 test/fastq/sample_1_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20678948\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_02_R1.fastq.gz and test/fastq/sample_1_02_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam : \
 test/fastq/sample_1_02_R1.fastq.gz  \
 test/fastq/sample_1_02_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  ${REFERENCE} \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




#
# index final BAM for Sample 'Sample2'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample2'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_2_01_R1.fastq.gz and test/fastq/sample_2_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_2_01_R1.fastq.gz  \
 test/fastq/sample_2_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20681172\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




$(addsuffix .fai,${REFERENCE}): ${REFERENCE} ${samtools.exe}
 ${samtools.exe} faidx $<

$(addsuffix .bwt,${REFERENCE}): ${REFERENCE} ${bwa.exe}
 ${bwa.exe} index $<


${BINDIR}/bwa-0.7.10/bwa :
 rm -rf $(BINDIR)/bwa-0.7.10/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/bwa-0.7.10.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/bwa-0.7.10.tar.bz2 && \
 make -C $(dir $@)

${BINDIR}/samtools-0.1.19/bcftools/bcftools: ${BINDIR}/samtools-0.1.19/samtools

${BINDIR}/samtools-0.1.19/samtools  : 
 rm -rf $(BINDIR)/samtools-0.1.19/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/samtools-0.1.19.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/samtools-0.1.19.tar.bz2 && \
 make -C $(dir $@)


${BINDIR}/tabix-0.2.6/bgzip : ${BINDIR}/tabix-0.2.6/tabix


${BINDIR}/tabix-0.2.6/tabix  : 
 rm -rf $(BINDIR)/tabix-0.2.6/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/tabix-0.2.6.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/tabix-0.2.6.tar.bz2 && \
 make -C $(dir $@) tabix bgzip

clean:
 rm -rf ${BINDIR}

Drawing the Workflow:
The workflow is drawn with https://github.com/lindenb/makefile2graph.


Running Make:
And here is the output of make:
rm -rf /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/
bwa-0.7.10/
bwa-0.7.10/bamlite.c
(...)
gcc -g -Wall -Wno-unused-function -O2 -msse -msse2 -msse3 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS QSufSort.o bwt_gen.o bwase.o bwaseqio.o bwtgap.o bwtaln.o bamlite.o is.o bwtindex.o bwape.o kopen.o pemerge.o bwtsw2_core.o bwtsw2_main.o bwtsw2_aux.o bwt_lite.o bwtsw2_chain.o fastmap.o bwtsw2_pair.o main.o -o bwa -L. -lbwa -lm -lz -lpthread
make[1]: Entering directory `/home/lindenb/src/ngsxml'
/home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa index test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/
samtools-0.1.19/
samtools-0.1.19/.gitignore
samtools-0.1.19/AUTHORS
(...)
gcc -g -Wall -O2 -o bamcheck bamcheck.o -L.. -lm -lbam -lpthread -lz
make[3]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/misc'
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19'
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11671724\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  test/ref/ref.fa \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
  /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools merge -f OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11673828\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted.bam  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools faidx test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ tabix bgzip
tabix-0.2.6/
tabix-0.2.6/ChangeLog
tabix-0.2.6/Makefile
(...)
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6'
mkdir -p OUT/Projects/Proj1/VCF/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools mpileup -uf test/ref/ref.fa OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam | \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/bcftools/bcftools view -vcg - > OUT/Projects/Proj1/VCF/Proj1.vcf  && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/bgzip -f OUT/Projects/Proj1/VCF/Proj1.vcf && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/tabix -f -p vcf OUT/Projects/Proj1/VCF/Proj1.vcf.gz
make[1]: Leaving directory `/home/lindenb/src/ngsxml'
At the end, a VCF is generated
##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
chr4_gl000194_random 1973 . C G 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,3,0;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2462 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2492 . G C 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2504 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
That's it,

Pierre

03 September 2014

Parallelizing GNU #Make 4 in a #SLURM infrastructure/cluster

SLURM (https://computing.llnl.gov/linux/slurm/slurm.html) is The Simple Linux Utility for Resource Management . It's an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

A patch for GNU Make version 3.81 is available as part of the SLURM distribution in https://github.com/SchedMD/slurm/blob/master/contribs/make.slurm.patch This patch will use SLURM to launch tasks across a job's current resource allocation.

The patch for Make 'just' wraps the command into srun: ( "srun is the command sending a parallel job on cluster managed by SLURM. )

Index: job.c
===================================================================
--- job.c (revision 321)
+++ job.c (working copy)
@@ -1959,6 +1959,22 @@
void
child_execute_job (int stdin_fd, int stdout_fd, char **argv, char **envp)
{
+/* PARALLEL JOB LAUNCH VIA SLURM */
+ if (getenv("SLURM_JOB_ID")) {
+ int i;
+ static char *argx[128];
+ argx[0] = "srun";
+ argx[1] = "-N1";
+ argx[2] = "-n1";
+ for (i=0; ((i<124)&&(argv[i])); i++) {
+ argx[i+3] = argv[i];
+ }
+ if (i<124) {
+ argx[i+3] = NULL;
+ argv = argx;
+ }
+ }
+/* END OF SLURM PATCH */
if (stdin_fd != 0)
(void) dup2 (stdin_fd, 0);
if (stdout_fd != 1)

GNU-Make version 4 was recently released. This new version comes with a number of improvements like GNU Guile integration, Loadable objects (see http://plindenbaum.blogspot.fr/2014/08/a-gnu-make-plug-in-for-illumina-fastqs.html ). It also allows to specify the default shell to be invoked (see http://plindenbaum.blogspot.fr/2014/01/parallelizing-rstats-using-make.html )

http://www.gnu.org/software/make/manual/make.html : The program used as the shell is taken from the variable SHELL. If this variable is not set in your makefile, the program /bin/sh is used as the shell. The argument(s) passed to the shell are taken from the variable .SHELLFLAGS. The default value of .SHELLFLAGS is -c normally, or -ec in POSIX-conforming mode.

So, if you want to parallelize GNU-Make with SLURM you can wrap the shell into srun using SHELL and .SHELLFLAGS. Here is an example, creating and concatenating 100 files containing the hostname:

ifdef SLURM_JOB_ID
SHELL=srun
.SHELLFLAGS= -N1 -n1  bash -c 
endif
NUMBERS=$(shell seq 1 100 )
TARGETS=  $(addsuffix .test,${NUMBERS} )


.PHONY:  all clean

define TEST

$(addsuffix .test,$(1)) : 
        echo -n  $(1) " " > $$@ && hostname >> $$@
        @sleep 5

endef


all: ${TARGETS}
        cat $^

$(foreach N,$(NUMBERS), $(eval $(call TEST,$(N) ) ) )

clean:
        rm -f ${TARGETS}

now invoke Make with SLURM and the option -j ( Allow -j N jobs at once ):

$ make -j 10
echo -n  1  " " > 1.test && hostname >> 1.test
echo -n  2  " " > 2.test && hostname >> 2.test
echo -n  3  " " > 3.test && hostname >> 3.test
echo -n  4  " " > 4.test && hostname >> 4.test
echo -n  5  " " > 5.test && hostname >> 5.test
echo -n  6  " " > 6.test && hostname >> 6.test
echo -n  7  " " > 7.test && hostname >> 7.test
(...)
echo -n  100  " " > 100.test && hostname >> 100.test
cat 1.test 2.test 3.test 4.test 5.test 6.test 7.test 8.test (...)
1  node004
2  node003
3  node001
4  node002
5  node002
6  node001
7  node002
8  node001
9  node001
10  node002
(...)
92  node004
93  node001
94  node001
95  node001
96  node001
97  node002
98  node001
99  node001
100  node001
That's it,
Pierre

08 August 2014

A GNU-make plug-in for the #Illumina FASTQs.

The latest version of GNU-Make http://www.gnu.org/software/make/ provides many advanced capabilities, including many useful functions. However, it does not contain a complete programming language and so it has limitations. Sometimes these limitations can be overcome through use of the shell function to invoke a separate program, although this can be inefficient. On systems which support dynamically loadable objects, you can write your own extension in any language (which can be compiled into such an object) and load it to provide extended capabilities ( see http://www.gnu.org/software/make/manual/make.html#Loading-Objects )

Building a plug-in for the Illumina FASTQs.

from http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

For example, the following is a valid FASTQ file name:

NA10831_ATCACG_L002_R1_001.fastq.gz

Here I'm writing a set of new functions for makefile to extract the different parts (sample, lane...) of a fastq file-name:

The code is available on github.com at

First a struct holding the parts of the file is created:

enum E_IlluminaComponent
    {
    E_sampleName,
    E_barcodeSequence,
    E_lane,
    E_readNumber,
    E_setNumber
    };

typedef struct illumina_scheme_t
    {
    char* filename;
    char* components[NUM_ILLUMINA_COMPONENTS];
    } IlluminaScheme,*IlluminaSchemePtr ;

and a function parsing the filenames is created:

IlluminaSchemePtr IlluminaSchemeNew(const char* filename)
    {
    ...
    }

when the plugin llumina is loaded as a dynamic C library, the method llumina_gmk_setup is called,
and we tell make about the new functions with gmk_add_function(name,callback,min_args,max_args,no_expand_content) :

int illumina_gmk_setup ()
  {
   gmk_add_function ("illumina_sample",illumina_sample, 1, 1, 0);
   gmk_add_function ("illumina_lane",illumina_lane, 1, 1, 0);
   (...)
  }

A function registered with make must match the gmk_func_ptr type.
It will be invoked with three parameters: name (the name of the function), argc (the number of arguments to the function), and argv (an array of pointers to arguments to the function). The last pointer (that is, argv[argc]) will be null (0).
The return value of the function is the result of expanding the function.

char* illumina_sample(const char *function_name, unsigned int argc, char **argv)
    {
    /** extract the filename(s), build and return a string containing the samples */
    }

Compiling

the plugin must be compiled as a dynamic C library.

Note: The manual says this step can also be generated in the final 'Makefile' (via load ./illumina.so) but I was not able to compile a missing library (illumina.so cannot open shared object file: No such file or directory)

so I compiled it by hand:

gcc -Wall -I/path/to/sources/make-4.0 -shared -fPIC -o illumina.so illumina.c

Test

here is the makefile:

SAMPLES=  NA10831_ATCACG_L002_R1_001.fastq.gz \
      hello \
      NA10832_ATCACG_L002_R1_001.fastq.gz \
      NA10831_ATCACG_L002_R2_001.fastq.gz \
      NA10832_ATCACG_L002_R2_001.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      ERROR_ATCAGG_x003_R3_0z3.fastq.gz \
      false

all:
    @echo "SAMPLES: " $(illumina_sample  ${SAMPLES} )
    @echo "BARCODES: " $(illumina_barcode  ${SAMPLES} )
    @echo "LANE: " $(illumina_lane  ${SAMPLES} )
    @echo "READ: " $(illumina_read  ${SAMPLES} )
    @echo "SET: " $(illumina_set  ${SAMPLES} )

output:

$ make
SAMPLES:  NA10831 NA10832 NA10833
BARCODES:  ATCACG ATCAGG
LANE:  L002 L003
READ:  R1 R2
SET:  001 003

That's it,

Pierre

29 July 2014

Including the hash for the current git-commit in a C program

Say you wrote the following simple C program:
It includes a file "githash.h" that would contain the hash for the current commit in Git:



Because you're working with a Makefile, the file "githash.h" is generated by invoking 'git rev-parse HEAD ':



the file "githash.h" loooks like this:




But WAIT that is not so simple, once the file 'githash.h' has been created it won't be updated by Make as it already exists. This file should be removed each time 'git commit' is invoked. We can do this by creating POST COMMIT git hook: we create a bash script named ".git/hooks/post-commit" removing 'githash.h:



don't forget make it executable: `chmod +x .git/hooks/post-commit`

Now, each time 'git commit' is called, the file githash.h for the previous git-commit will be deleted !


That's it,

Pierre








05 July 2014

Pushed : makefile2graph , creating a graph of dependencies from GNU-Make.

I pushed makefile2graph at https://github.com/lindenb/makefile2graph. This is the standalone and 'C' implementation of a program I first wrote in java in 2012. The program creates a graph of dependencies from GNU-Make and its output is a graphiz-dot file or a Gexf/Gephi-XML file.

Usage

$ make -Bnd | make2graph > output.dot
$ make -Bnd | make2graph -x > gephi.gexf.xml 

Example

Here is the output of makefile2graph for Tabix:
$ cd tabix-0.2.5
$ make -Bnd |make2graph
digraph G {
n1[label="", color="green"];
n2[label="Makefile", color="green"];
n4[label="all", color="red"];
n3[label="all-recur", color="red"];
n23[label="bedidx.c", color="green"];
n22[label="bedidx.o", color="red"];
n9[label="bgzf.c", color="green"];
n10[label="bgzf.h", color="green"];
n8[label="bgzf.o", color="red"];
n27[label="bgzip", color="red"];
n29[label="bgzip.c", color="green"];
n28[label="bgzip.o", color="red"];
n18[label="index.c", color="green"];
n17[label="index.o", color="red"];
n20[label="khash.h", color="green"];
n16[label="knetfile.c", color="green"];
n11[label="knetfile.h", color="green"];
n15[label="knetfile.o", color="red"];
n24[label="kseq.h", color="green"];
n21[label="ksort.h", color="green"];
n13[label="kstring.c", color="green"];
n14[label="kstring.h", color="green"];
n12[label="kstring.o", color="red"];
n6[label="lib", color="red"];
n7[label="libtabix.a", color="red"];
n26[label="main.c", color="green"];
n25[label="main.o", color="red"];
n5[label="tabix", color="red"];
n19[label="tabix.h", color="green"];
n2 -> n1 ; 
n4 -> n1 ; 
n3 -> n1 ; 
(..)
}

That's it
Pierre

15 May 2014

How I start a bioinformatics project

Phil Ashton tweeted a link to a paper about how to set up a bioinformatics project file hierarchy: " A Quick Guide to Organizing Computational Biology Projects ".

Nick Loman posted his version yesterday : "How I start a bioinformatics project" on http://nickloman.github.io/2014/05/14/how-i-start-a-bioinformatics-project/.

Here is mine (simplified):

  • I start by creating a directory managed by git
  • I create a JSON-based description of my data, including the path to the softwares, to the references
  • I create a git submodule for a project hosting an Apache-velocity template transforming a Makefile from config.json :
  • The Makefile is generated using jsvelocity :It produces the following Makefile:
  • The Makefile is invoked with option -j N(Allow N jobs at once) using GNU-Make or QMake(distributed parallel make, scheduled by Sun Grid Engine)

That's it,

Pierre