Divide-and-conquer in a #Makefile : recursivity and #parallelism.
This post is my notebook about implementing a divide-and-conquer strategy in GNU make.
Say you have a list of 'N' VCFs files. You want to create a list of:
- common SNPs in vcf1 and vcf2
- common SNPs in vcf3 and the previous list
- common SNPs in vcf4 and the previous list
- (...)
- common SNPs in vcfN and the previous list
grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq
Using a linear Makefile it could look like:
list2: vcf1 vcf2 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list3: vcf3 list2 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list4: vcf4 list3 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list5: vcf5 list4 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ (...)
We can speed-up the workflow using the parallel option of make -j (number-of-parallel-jobs) and using a divide-and-conquer strategy. Here, the targets 'list1_2' and 'list3_4' can be processed independently in parallel.
list1_2: vcf1 vcf2 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list3_4: vcf3 vcf4 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list1_4: list1_2 list3_4 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
By using the internal 'Make' functions $(eval), $(call), $(shell) we can define a recursive method "recursive". This method takes two arguments which are the 0-based indexes of a VCF in the list of VCFs. Here is the Makefile:
Running the makefile:
$ make gunzip -c Sample18.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target0_1 gunzip -c Sample13.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target1_2 LC_ALL=C comm -12 target0_1 target1_2 > target0_2 gunzip -c Sample1.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target2_3 gunzip -c Sample19.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target3_4 gunzip -c Sample12.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target4_5 LC_ALL=C comm -12 target3_4 target4_5 > target3_5 LC_ALL=C comm -12 target2_3 target3_5 > target2_5 LC_ALL=C comm -12 target0_2 target2_5 > target0_5 gunzip -c Sample17.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target5_6 gunzip -c Sample16.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target6_7 LC_ALL=C comm -12 target5_6 target6_7 > target5_7 gunzip -c Sample9.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target7_8 gunzip -c Sample15.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target8_9 gunzip -c Sample5.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target9_10 LC_ALL=C comm -12 target8_9 target9_10 > target8_10 LC_ALL=C comm -12 target7_8 target8_10 > target7_10 LC_ALL=C comm -12 target5_7 target7_10 > target5_10 LC_ALL=C comm -12 target0_5 target5_10 > target0_10 gunzip -c Sample14.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target10_11 gunzip -c Sample3.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target11_12 LC_ALL=C comm -12 target10_11 target11_12 > target10_12 gunzip -c Sample11.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target12_13 gunzip -c Sample2.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target13_14 gunzip -c Sample6.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target14_15 LC_ALL=C comm -12 target13_14 target14_15 > target13_15 LC_ALL=C comm -12 target12_13 target13_15 > target12_15 LC_ALL=C comm -12 target10_12 target12_15 > target10_15 gunzip -c Sample20.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target15_16 gunzip -c Sample10.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target16_17 LC_ALL=C comm -12 target15_16 target16_17 > target15_17 gunzip -c Sample4.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target17_18 gunzip -c Sample8.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target18_19 gunzip -c Sample7.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target19_20 LC_ALL=C comm -12 target18_19 target19_20 > target18_20 LC_ALL=C comm -12 target17_18 target18_20 > target17_20 LC_ALL=C comm -12 target15_17 target17_20 > target15_20 LC_ALL=C comm -12 target10_15 target15_20 > target10_20 LC_ALL=C comm -12 target0_10 target10_20 > target0_20and here is the generated workflow (drawn with make2graph ).
:
That's it
Pierre.