In my previous post, I've shown how a the .INTERMEDIATE keyword can be used in a Makefile to reduce the number of temporary files generated in a classical NGS pipeline. 'Make' also has an option '-j' to specify the number of threads: some independent tasks (such as aligning two distinct samples) can be processed in parallel .
At this point, I should cite this tweet from Sean Davis:
— Sean Davis (@seandavis12) August 30, 2012
@yokofakun You might also take a look at distmake if you are on a cluster.Ruffus (python) is an option more bioinfo oriented than make
... and this even-more-recent tweet from Justin H. Johnson:
Snakemake - A scalable bioinformatics workflow engine. 1.usa.gov/NqPI9l
— Justin H. Johnson (@BioInfo) August 31, 2012#Bioinformatics
Here, I will show how XSLT can be used to generate a whole Makefile to process the output of Illumina/Casava after the FASTQs files have been generated using configureBclToFastq.pl using --fastq-cluster-count.
A Xml descriptor generated by Illumina Casava looks like this:
<DemultiplexConfig> <Software/> <FlowcellInfo ID="C0KTCACXX" Operator="JohnDoe" Recipe="" Desc=""> <Lane Number="1"> <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" /> <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" /> <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" /> <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane1" Desc="Clusters with unmatched barcodes for lane 1" Ref="unknown" /> </Lane> <Lane Number="2"> <Sample ProjectId="WGS" Control="N" Index="CTTGTA" SampleId="Father" Desc="1.5pM" Ref="" /> <Sample ProjectId="WGS" Control="N" Index="GCCAAT" SampleId="Mother" Desc="1.5pM" Ref="" /> <Sample ProjectId="WGS" Control="N" Index="TGACCA" SampleId="Daughter" Desc="1.5pM" Ref="" /> <Sample ProjectId="Undetermined_indices" Control="N" Index="Undetermined" SampleId="lane2" Desc="Clusters with unmatched barcodes for lane 2" Ref="unknown" /> </Lane> </FlowcellInfo> </DemultiplexConfig>This document contains the required informations to find the fastqs and to generate a Makefile. I wrote the following XSLT stylesheet:
This stylesheet transforms the DemultiplexConfig.xml file to the following Makefile:
# #files must have been generated using CASAVA/configureBclToFastq.pl with --fastq-cluster-count 0 #path to bwa override BWA=/path/to/bwa-0.6.1/bwa BWASAMPEFLAG= -a 600 BWAALNFLAG= -t 4 #path to samtools SAMTOOLS=/path/to/samtools #path to BEDTOOLS BEDTOOLS=/path/to/bedtools/bin #output directory OUTDIR=2012831.OUTPUT.idp0 INPUTDIR=. REF=/path/to/human_g1k_v37.fasta JAVA=java PICARDLIB=/path/to/picard-tools TABIX=/path/to/tabix GATK=/path/to/GenomeAnalysisTK.jar GATKFLAGS= VCFDBSNP=/path/to/ncbi/snp/00-All.vcf.gz #rule how to create a bai from a bam: %.bam.bai: %.bam #indexing BAM $< $(SAMTOOLS) index $< #creates the default output directory $(OUTDIR): mkdir -p $@ #indexes the reference for bwa $(REF).bwt $(REF).ann $(REF).amb $(REF).pac $(REF).sa : $(REF) $(BWA) index -a bwtsw $< #indexes the reference for samtools $(REF).fai: $(REF) $(SAMTOOLS) faidx $< #creates sequence dict for picard $(REF).dict: $(REF) $(JAVA) -jar $(PICARDLIB)/CreateSequenceDictionary.jar REFERENCE=$(REF) OUTPUT=$< #intermediate see http://plindenbaum.blogspot.fr/2012/08/next-generation-sequencing-gnu-make-and.html .INTERMEDIATE : \ $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai \ $(OUTDIR)/Project_WGS/Sample_Father/Father.bam \ $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \ $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam \ $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai \ $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam \ $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \ $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam \ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai \ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam \ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam #main target : generate the VCFs all : vcfs vcfs: \ $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz \ $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz \ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz $(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Father/Father.bam $(OUTDIR)/Project_WGS/Sample_Father/Father.bam.bai $(JAVA) -jar $(GATK) $(GATKFLAGS) \ -T RealignerTargetCreator \ -R $(REF) \ -I $< \ -o $<.intervals \ --known $(VCFDBSNP) $(JAVA) -jar $(GATK) $(GATKFLAGS) \ -T IndelRealigner \ -R $(REF) \ -I $< \ -o $@ \ -targetIntervals $<.intervals \ --knownAlleles $(VCFDBSNP) rm -f $<.intervals #creates a BAM for 'Father' by merging the other lanes $(OUTDIR)/Project_WGS/Sample_Father/Father.bam: $(OUTDIR) \ $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \ $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam #merge BAMS $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \ INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam \ INPUT=$(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam #fix mate information $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \ INPUT=$@ #validate $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \ VALIDATE_INDEX=true \ I=$@ #Creates a BAM for 'Project_WGS/Sample_Father/Father_CTTGTA_L001' $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001.bam: $(REF).bwt \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz #create directory mkdir -p $(OUTDIR)/Project_WGS/Sample_Father #align fastq1 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz #align fastq2 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp22528_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz #sampe those files $(BWA) sampe $(BWASAMPEFLAG) -r '@RG ID:idp22528 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \ -f $(OUTDIR)/idp22528.sam \ $(OUTDIR)/idp22528_1.sai \ $(OUTDIR)/idp22528_2.sai \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L001_R2_001.fastq.gz #remove sai rm -f $(OUTDIR)/idp22528_1.sai $(OUTDIR)/idp22528_2.sai #convert sam to bam $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528.sam #remove the sam rm -f $(OUTDIR)/idp22528.sam #sort this bam $(SAMTOOLS) sort $(OUTDIR)/idp22528_unsorted.bam $(OUTDIR)/idp22528_sorted #remove the unsorted bam rm -f $(OUTDIR)/idp22528_unsorted.bam #mark duplicates $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \ INPUT=$(OUTDIR)/idp22528_sorted.bam \ OUTPUT=$(OUTDIR)/idp22528_dup.bam \ METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics #remove before duplicate rm -f $(OUTDIR)/idp22528_sorted.bam #move to final bam mv $(OUTDIR)/idp22528_dup.bam $@ $(OUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002.bam: $(REF).bwt \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz #create directory mkdir -p $(OUTDIR)/Project_WGS/Sample_Father #align fastq1 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_1.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz #align fastq2 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp31344_2.sai $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz #sampe those files $(BWA) sampe $(BWASAMPEFLAG) -r '@RG ID:idp31344 CN:MyLaboratory LB:Father PG:bwa SM:Father PL:ILLUMINA' $(REF) \ -f $(OUTDIR)/idp31344.sam \ $(OUTDIR)/idp31344_1.sai \ $(OUTDIR)/idp31344_2.sai \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Father/Father_CTTGTA_L002_R2_001.fastq.gz #remove sai rm -f $(OUTDIR)/idp31344_1.sai $(OUTDIR)/idp31344_2.sai #convert sam to bam $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344.sam #remove the sam rm -f $(OUTDIR)/idp31344.sam #sort this bam $(SAMTOOLS) sort $(OUTDIR)/idp31344_unsorted.bam $(OUTDIR)/idp31344_sorted #remove the unsorted bam rm -f $(OUTDIR)/idp31344_unsorted.bam #mark duplicates $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \ INPUT=$(OUTDIR)/idp31344_sorted.bam \ OUTPUT=$(OUTDIR)/idp31344_dup.bam \ METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Father/Father.metrics #remove before duplicate rm -f $(OUTDIR)/idp31344_sorted.bam #move to final bam mv $(OUTDIR)/idp31344_dup.bam $@ #creates a VCF for 'Father' $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Father/Father.realigned.bam $(JAVA) -jar $(GATK) \ -T UnifiedGenotyper \ -glm BOTH \ -baq OFF \ -R $(REF) \ -I $< \ --dbsnp $(VCFDBSNP) \ -o $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Father/Father.vcf.gz $(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam.bai $(JAVA) -jar $(GATK) $(GATKFLAGS) \ -T RealignerTargetCreator \ -R $(REF) \ -I $< \ -o $<.intervals \ --known $(VCFDBSNP) $(JAVA) -jar $(GATK) $(GATKFLAGS) \ -T IndelRealigner \ -R $(REF) \ -I $< \ -o $@ \ -targetIntervals $<.intervals \ --knownAlleles $(VCFDBSNP) rm -f $<.intervals #creates a BAM for 'Mother' by merging the other lanes $(OUTDIR)/Project_WGS/Sample_Mother/Mother.bam: $(OUTDIR) \ $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \ $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam #merge BAMS $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \ INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam \ INPUT=$(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam #fix mate information $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \ INPUT=$@ #validate $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \ VALIDATE_INDEX=true \ I=$@ #Creates a BAM for 'Project_WGS/Sample_Mother/Mother_GCCAAT_L001' $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001.bam: $(REF).bwt \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz #create directory mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother #align fastq1 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz #align fastq2 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp15352_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz #sampe those files $(BWA) sampe $(BWASAMPEFLAG) -r '@RG ID:idp15352 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \ -f $(OUTDIR)/idp15352.sam \ $(OUTDIR)/idp15352_1.sai \ $(OUTDIR)/idp15352_2.sai \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L001_R2_001.fastq.gz #remove sai rm -f $(OUTDIR)/idp15352_1.sai $(OUTDIR)/idp15352_2.sai #convert sam to bam $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352.sam #remove the sam rm -f $(OUTDIR)/idp15352.sam #sort this bam $(SAMTOOLS) sort $(OUTDIR)/idp15352_unsorted.bam $(OUTDIR)/idp15352_sorted #remove the unsorted bam rm -f $(OUTDIR)/idp15352_unsorted.bam #mark duplicates $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \ INPUT=$(OUTDIR)/idp15352_sorted.bam \ OUTPUT=$(OUTDIR)/idp15352_dup.bam \ METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics #remove before duplicate rm -f $(OUTDIR)/idp15352_sorted.bam #move to final bam mv $(OUTDIR)/idp15352_dup.bam $@ $(OUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002.bam: $(REF).bwt \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz #create directory mkdir -p $(OUTDIR)/Project_WGS/Sample_Mother #align fastq1 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_1.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz #align fastq2 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idm22432_2.sai $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz #sampe those files $(BWA) sampe $(BWASAMPEFLAG) -r '@RG ID:idm22432 CN:MyLaboratory LB:Mother PG:bwa SM:Mother PL:ILLUMINA' $(REF) \ -f $(OUTDIR)/idm22432.sam \ $(OUTDIR)/idm22432_1.sai \ $(OUTDIR)/idm22432_2.sai \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Mother/Mother_GCCAAT_L002_R2_001.fastq.gz #remove sai rm -f $(OUTDIR)/idm22432_1.sai $(OUTDIR)/idm22432_2.sai #convert sam to bam $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432.sam #remove the sam rm -f $(OUTDIR)/idm22432.sam #sort this bam $(SAMTOOLS) sort $(OUTDIR)/idm22432_unsorted.bam $(OUTDIR)/idm22432_sorted #remove the unsorted bam rm -f $(OUTDIR)/idm22432_unsorted.bam #mark duplicates $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \ INPUT=$(OUTDIR)/idm22432_sorted.bam \ OUTPUT=$(OUTDIR)/idm22432_dup.bam \ METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Mother/Mother.metrics #remove before duplicate rm -f $(OUTDIR)/idm22432_sorted.bam #move to final bam mv $(OUTDIR)/idm22432_dup.bam $@ #creates a VCF for 'Mother' $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Mother/Mother.realigned.bam $(JAVA) -jar $(GATK) \ -T UnifiedGenotyper \ -glm BOTH \ -baq OFF \ -R $(REF) \ -I $< \ --dbsnp $(VCFDBSNP) \ -o $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Mother/Mother.vcf.gz $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam.bai $(JAVA) -jar $(GATK) $(GATKFLAGS) \ -T RealignerTargetCreator \ -R $(REF) \ -I $< \ -o $<.intervals \ --known $(VCFDBSNP) $(JAVA) -jar $(GATK) $(GATKFLAGS) \ -T IndelRealigner \ -R $(REF) \ -I $< \ -o $@ \ -targetIntervals $<.intervals \ --knownAlleles $(VCFDBSNP) rm -f $<.intervals #creates a BAM for 'Daughter' by merging the other lanes $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.bam: $(OUTDIR) \ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam #merge BAMS $(JAVA) -jar $(PICARDLIB)/MergeSamFiles.jar SORT_ORDER=coordinate OUTPUT=$@ \ INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam \ INPUT=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam #fix mate information $(JAVA) -jar $(PICARDLIB)/FixMateInformation.jar \ INPUT=$@ #validate $(JAVA) -jar $(PICARDLIB)/ValidateSamFile.jar \ VALIDATE_INDEX=true \ I=$@ #Creates a BAM for 'Project_WGS/Sample_Daughter/Daughter_TGACCA_L001' $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001.bam: $(REF).bwt \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz #create directory mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter #align fastq1 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz #align fastq2 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp5176_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz #sampe those files $(BWA) sampe $(BWASAMPEFLAG) -r '@RG ID:idp5176 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \ -f $(OUTDIR)/idp5176.sam \ $(OUTDIR)/idp5176_1.sai \ $(OUTDIR)/idp5176_2.sai \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L001_R2_001.fastq.gz #remove sai rm -f $(OUTDIR)/idp5176_1.sai $(OUTDIR)/idp5176_2.sai #convert sam to bam $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176.sam #remove the sam rm -f $(OUTDIR)/idp5176.sam #sort this bam $(SAMTOOLS) sort $(OUTDIR)/idp5176_unsorted.bam $(OUTDIR)/idp5176_sorted #remove the unsorted bam rm -f $(OUTDIR)/idp5176_unsorted.bam #mark duplicates $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \ INPUT=$(OUTDIR)/idp5176_sorted.bam \ OUTPUT=$(OUTDIR)/idp5176_dup.bam \ METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics #remove before duplicate rm -f $(OUTDIR)/idp5176_sorted.bam #move to final bam mv $(OUTDIR)/idp5176_dup.bam $@ $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002.bam: $(REF).bwt \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz #create directory mkdir -p $(OUTDIR)/Project_WGS/Sample_Daughter #align fastq1 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_1.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz #align fastq2 with BWA $(BWA) aln $(BWAALNFLAG) $(REF) -f $(OUTDIR)/idp211744_2.sai $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz #sampe those files $(BWA) sampe $(BWASAMPEFLAG) -r '@RG ID:idp211744 CN:MyLaboratory LB:Daughter PG:bwa SM:Daughter PL:ILLUMINA' $(REF) \ -f $(OUTDIR)/idp211744.sam \ $(OUTDIR)/idp211744_1.sai \ $(OUTDIR)/idp211744_2.sai \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R1_001.fastq.gz \ $(INPUTDIR)/Project_WGS/Sample_Daughter/Daughter_TGACCA_L002_R2_001.fastq.gz #remove sai rm -f $(OUTDIR)/idp211744_1.sai $(OUTDIR)/idp211744_2.sai #convert sam to bam $(SAMTOOLS) view -b -T $(REF) -o $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744.sam #remove the sam rm -f $(OUTDIR)/idp211744.sam #sort this bam $(SAMTOOLS) sort $(OUTDIR)/idp211744_unsorted.bam $(OUTDIR)/idp211744_sorted #remove the unsorted bam rm -f $(OUTDIR)/idp211744_unsorted.bam #mark duplicates $(JAVA) -jar $(PICARDLIB)/MarkDuplicates.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true TMP_DIR=$(OUTDIR) \ INPUT=$(OUTDIR)/idp211744_sorted.bam \ OUTPUT=$(OUTDIR)/idp211744_dup.bam \ METRICS_FILE=$(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.metrics #remove before duplicate rm -f $(OUTDIR)/idp211744_sorted.bam #move to final bam mv $(OUTDIR)/idp211744_dup.bam $@ #creates a VCF for 'Daughter' $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz : $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.realigned.bam $(JAVA) -jar $(GATK) \ -T UnifiedGenotyper \ -glm BOTH \ -baq OFF \ -R $(REF) \ -I $< \ --dbsnp $(VCFDBSNP) \ -o $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf $(TABIX)/bgzip $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf $(TABIX)/tabix -f -p vcf $(OUTDIR)/Project_WGS/Sample_Daughter/Daughter.vcf.gz(Of course, this workflow would satisfy my needs, may be not yours)
Et voila !
That's it !
Pierre
I've created a tool that just infers the CASAVA file structure and creates the Makefile from that. I like make + SGE it's a nice and simple solution. Hierarchical makefiles also makes it dead easy to re-run parts of the analysis.
ReplyDelete/Lars