XML+XSLT = #Makefile -based #workflows for #bioinformatics
I've recently read some conversations on Twitter about Makefile-based bioinformatics workflows. I've suggested on biostars.org (Standard simple format to describe a bioinformatics analysis pipeline) that a XML file could be used to describe a model of data and XSLT could transform this model to a Makefile-based workflow. I've already explored this idea in a previous post (Generating a pipeline of analysis (Makefile) for NGS with xslt. ) and in our lab, we use JSON and jsvelocity to generate our workflows (e.g: https://gist.github.com/lindenb/3c07ca722f793cc5dd60).
In the current post, I'll describe my github repository containing a complete XML+XSLT example: https://github.com/lindenb/ngsxml.
Download the data
git clone "https://github.com/lindenb/ngsxml.git"
The model
The XML model is self-explanatory:
<?xml version="1.0" encoding="UTF-8"?> <model name="myProject" description="my project" directory="OUT"> <project name="Proj1"> <sample name="Sample1"> <fastq> <for>test/fastq/sample_1_01_R1.fastq.gz</for> <rev>test/fastq/sample_1_01_R2.fastq.gz</rev> </fastq> <fastq id="groupid2" lane="2" library="lib1" platform="ILMN" median-size="98"> <for>test/fastq/sample_1_02_R1.fastq.gz</for> <rev>test/fastq/sample_1_02_R2.fastq.gz</rev> </fastq> </sample> <sample name="Sample2"> <fastq> <for>test/fastq/sample_2_01_R1.fastq.gz</for> <rev>test/fastq/sample_2_01_R2.fastq.gz</rev> </fastq> </sample> </project> </model>
Validating the model:
This XML document is possibly validated with a XML schema:
$ xmllint --schema xsd/schema01.xsd --noout test/model01.xml test/model01.xml validates
Generating the Makefile:
The XML document is transformed into a Makefile using the following XSLT stylesheet: https://github.com/lindenb/ngsxml/blob/master/stylesheets/model2make.xsl
xsltproc --output Makefile stylesheets/model2make.xsl test/model01.xmlThe Makefile:
# Name # myProject # Description: # my project # include config.mk OUTDIR=OUT BINDIR=$(abspath ${OUTDIR})/bin # if tools are undefined bwa.exe ?=${BINDIR}/bwa-0.7.10/bwa samtools.exe ?=${BINDIR}/samtools-0.1.19/samtools bcftools.exe ?=${BINDIR}/samtools-0.1.19/bcftools/bcftools tabix.exe ?=${BINDIR}/tabix-0.2.6/tabix bgzip.exe ?=${BINDIR}/tabix-0.2.6/bgzip .PHONY=all clean all_bams all_vcfs all: all_vcfs all_vcfs: \ $(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz all_bams: \ $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam \ $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam # # VCF for project 'Proj1' # $(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz : $(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam) \ $(addsuffix .fai,${REFERENCE}) ${samtools.exe} ${bgzip.exe} ${tabix.exe} ${bcftools.exe} mkdir -p $(dir $@) && \ ${samtools.exe} mpileup -uf ${REFERENCE} $(basename $(filter %.bai,$^)) | \ ${bcftools.exe} view -vcg - > $(basename $@) && \ ${bgzip.exe} -f $(basename $@) && \ ${tabix.exe} -f -p vcf $@ # # index final BAM for Sample 'Sample1' # $(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam ${samtools.exe} mkdir -p $(dir $@) && \ ${samtools.exe} index $< # # prepare final BAM for Sample 'Sample1' # $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam mkdir -p $(dir $@) && \ cp $< $@ $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam ${samtools.exe} mkdir -p $(dir $@) && \ ${samtools.exe} rmdup $< $@ # # merge BAMs # $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam : \ $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam \ $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe} mkdir -p $(dir $@) && \ ${samtools.exe} merge -f $@ $(filter %.bam,$^) # # Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam # $(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe} ${samtools} index $< # # Align test/fastq/sample_1_01_R1.fastq.gz and test/fastq/sample_1_01_R2.fastq.gz # $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam : \ test/fastq/sample_1_01_R1.fastq.gz \ test/fastq/sample_1_01_R2.fastq.gz \ $(addsuffix .bwt,${REFERENCE}) \ ${bwa.exe} ${samtools.exe} mkdir -p $(dir $@) && \ ${bwa.exe} mem -R '@RG\tID:idp20678948\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \ ${REFERENCE} \ test/fastq/sample_1_01_R1.fastq.gz \ test/fastq/sample_1_01_R2.fastq.gz |\ ${samtools.exe} view -uS - |\ ${samtools.exe} sort - $(basename $@) # # Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam # $(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe} ${samtools} index $< # # Align test/fastq/sample_1_02_R1.fastq.gz and test/fastq/sample_1_02_R2.fastq.gz # $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam : \ test/fastq/sample_1_02_R1.fastq.gz \ test/fastq/sample_1_02_R2.fastq.gz \ $(addsuffix .bwt,${REFERENCE}) \ ${bwa.exe} ${samtools.exe} mkdir -p $(dir $@) && \ ${bwa.exe} mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \ ${REFERENCE} \ test/fastq/sample_1_02_R1.fastq.gz \ test/fastq/sample_1_02_R2.fastq.gz |\ ${samtools.exe} view -uS - |\ ${samtools.exe} sort - $(basename $@) # # index final BAM for Sample 'Sample2' # $(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam ${samtools.exe} mkdir -p $(dir $@) && \ ${samtools.exe} index $< # # prepare final BAM for Sample 'Sample2' # $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam mkdir -p $(dir $@) && \ cp $< $@ $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe} mkdir -p $(dir $@) && \ ${samtools.exe} rmdup $< $@ # # Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam # $(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe} ${samtools} index $< # # Align test/fastq/sample_2_01_R1.fastq.gz and test/fastq/sample_2_01_R2.fastq.gz # $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam : \ test/fastq/sample_2_01_R1.fastq.gz \ test/fastq/sample_2_01_R2.fastq.gz \ $(addsuffix .bwt,${REFERENCE}) \ ${bwa.exe} ${samtools.exe} mkdir -p $(dir $@) && \ ${bwa.exe} mem -R '@RG\tID:idp20681172\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \ ${REFERENCE} \ test/fastq/sample_2_01_R1.fastq.gz \ test/fastq/sample_2_01_R2.fastq.gz |\ ${samtools.exe} view -uS - |\ ${samtools.exe} sort - $(basename $@) $(addsuffix .fai,${REFERENCE}): ${REFERENCE} ${samtools.exe} ${samtools.exe} faidx $< $(addsuffix .bwt,${REFERENCE}): ${REFERENCE} ${bwa.exe} ${bwa.exe} index $< ${BINDIR}/bwa-0.7.10/bwa : rm -rf $(BINDIR)/bwa-0.7.10/ && \ mkdir -p $(BINDIR) && \ curl -o $(BINDIR)/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \ tar xvfj $(BINDIR)/bwa-0.7.10.tar.bz2 -C $(OUTDIR)/bin && \ rm $(BINDIR)/bwa-0.7.10.tar.bz2 && \ make -C $(dir $@) ${BINDIR}/samtools-0.1.19/bcftools/bcftools: ${BINDIR}/samtools-0.1.19/samtools ${BINDIR}/samtools-0.1.19/samtools : rm -rf $(BINDIR)/samtools-0.1.19/ && \ mkdir -p $(BINDIR) && \ curl -o $(BINDIR)/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \ tar xvfj $(BINDIR)/samtools-0.1.19.tar.bz2 -C $(OUTDIR)/bin && \ rm $(BINDIR)/samtools-0.1.19.tar.bz2 && \ make -C $(dir $@) ${BINDIR}/tabix-0.2.6/bgzip : ${BINDIR}/tabix-0.2.6/tabix ${BINDIR}/tabix-0.2.6/tabix : rm -rf $(BINDIR)/tabix-0.2.6/ && \ mkdir -p $(BINDIR) && \ curl -o $(BINDIR)/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \ tar xvfj $(BINDIR)/tabix-0.2.6.tar.bz2 -C $(OUTDIR)/bin && \ rm $(BINDIR)/tabix-0.2.6.tar.bz2 && \ make -C $(dir $@) tabix bgzip clean: rm -rf ${BINDIR}Drawing the Workflow:
The workflow is drawn with https://github.com/lindenb/makefile2graph.
Running Make:
And here is the output of make:
rm -rf /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/ && \ mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \ curl -o /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \ tar xvfj /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -C OUT/bin && \ rm /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 && \ make -C /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/ bwa-0.7.10/ bwa-0.7.10/bamlite.c (...) gcc -g -Wall -Wno-unused-function -O2 -msse -msse2 -msse3 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS QSufSort.o bwt_gen.o bwase.o bwaseqio.o bwtgap.o bwtaln.o bamlite.o is.o bwtindex.o bwape.o kopen.o pemerge.o bwtsw2_core.o bwtsw2_main.o bwtsw2_aux.o bwt_lite.o bwtsw2_chain.o fastmap.o bwtsw2_pair.o main.o -o bwa -L. -lbwa -lm -lz -lpthread make[1]: Entering directory `/home/lindenb/src/ngsxml' /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa index test/ref/ref.fa rm -rf /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/ && \ mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \ curl -o /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \ tar xvfj /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -C OUT/bin && \ rm /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 && \ make -C /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/ samtools-0.1.19/ samtools-0.1.19/.gitignore samtools-0.1.19/AUTHORS (...) gcc -g -Wall -O2 -o bamcheck bamcheck.o -L.. -lm -lbam -lpthread -lz make[3]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/misc' make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19' mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11671724\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \ test/ref/ref.fa \ test/fastq/sample_1_01_R1.fastq.gz \ test/fastq/sample_1_01_R2.fastq.gz |\ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \ test/ref/ref.fa \ test/fastq/sample_1_02_R1.fastq.gz \ test/fastq/sample_1_02_R2.fastq.gz |\ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools merge -f OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted.bam mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \ cp OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11673828\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \ test/ref/ref.fa \ test/fastq/sample_2_01_R1.fastq.gz \ test/fastq/sample_2_01_R2.fastq.gz |\ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted.bam OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \ cp OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools faidx test/ref/ref.fa rm -rf /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ && \ mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \ curl -o /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \ tar xvfj /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -C OUT/bin && \ rm /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 && \ make -C /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ tabix bgzip tabix-0.2.6/ tabix-0.2.6/ChangeLog tabix-0.2.6/Makefile (...) make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6' mkdir -p OUT/Projects/Proj1/VCF/ && \ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools mpileup -uf test/ref/ref.fa OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam | \ /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/bcftools/bcftools view -vcg - > OUT/Projects/Proj1/VCF/Proj1.vcf && \ /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/bgzip -f OUT/Projects/Proj1/VCF/Proj1.vcf && \ /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/tabix -f -p vcf OUT/Projects/Proj1/VCF/Proj1.vcf.gz make[1]: Leaving directory `/home/lindenb/src/ngsxml'At the end, a VCF is generated
##fileformat=VCFv4.1 ##samtoolsVersion=0.1.19-44428cd (...) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chr4_gl000194_random 1973 . C G 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,3,0;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8 chr4_gl000194_random 2462 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8 chr4_gl000194_random 2492 . G C 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8 chr4_gl000194_random 2504 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8That's it,
Pierre
1 comment:
Wow. That's pretty slick. Thanks for sharing!
Post a Comment