Divide-and-conquer in a #Makefile : recursivity and #parallelism.
This post is my notebook about implementing a divide-and-conquer strategy in GNU make.
Say you have a list of 'N' VCFs files. You want to create a list of:
- common SNPs in vcf1 and vcf2
- common SNPs in vcf3 and the previous list
- common SNPs in vcf4 and the previous list
- (...)
- common SNPs in vcfN and the previous list
grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq
Using a linear Makefile it could look like:
list2: vcf1 vcf2 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list3: vcf3 list2 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list4: vcf4 list3 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list5: vcf5 list4 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ (...)
We can speed-up the workflow using the parallel option of make -j (number-of-parallel-jobs) and using a divide-and-conquer strategy. Here, the targets 'list1_2' and 'list3_4' can be processed independently in parallel.
list1_2: vcf1 vcf2 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list3_4: vcf3 vcf4 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@ list1_4: list1_2 list3_4 grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
By using the internal 'Make' functions $(eval), $(call), $(shell) we can define a recursive method "recursive". This method takes two arguments which are the 0-based indexes of a VCF in the list of VCFs. Here is the Makefile:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
VCFFILES=Sample18.vcf.gz \ | |
Sample13.vcf.gz \ | |
Sample1.vcf.gz \ | |
Sample19.vcf.gz \ | |
Sample12.vcf.gz \ | |
Sample17.vcf.gz \ | |
Sample16.vcf.gz \ | |
Sample9.vcf.gz \ | |
Sample15.vcf.gz \ | |
Sample5.vcf.gz \ | |
Sample14.vcf.gz \ | |
Sample3.vcf.gz \ | |
Sample11.vcf.gz \ | |
Sample2.vcf.gz \ | |
Sample6.vcf.gz \ | |
Sample20.vcf.gz \ | |
Sample10.vcf.gz \ | |
Sample4.vcf.gz \ | |
Sample8.vcf.gz \ | |
Sample7.vcf.gz | |
define recursive | |
ifeq ($(shell echo "$(2)-$(1)" | bc),0) | |
else ifeq ($(shell echo "$(2)-$(1)" | bc),1) | |
target$(1)_$(2): $(word $(shell echo "$(1)+1" | bc),${VCFFILES}) | |
gunzip -c $$< | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > $$@ | |
else | |
target$(1)_$(2): target$(1)_$(shell echo "$(1) + ($(2)-$(1)) /2 " | bc) target$(shell echo "$(1) + ($(2)-$(1)) /2 " | bc)_$(2) | |
LC_ALL=C comm -12 $$^ > $$@ | |
$$(eval $$(call recursive,$(1),$(shell echo "$(1) + ($(2)-$(1)) /2 " | bc))) | |
$$(eval $$(call recursive,$(shell echo "$(1) + ($(2)-$(1)) /2 " | bc),$(2))) | |
endif | |
endef | |
all: target0_$(words ${VCFFILES}) | |
$(eval $(call recursive,0,$(words ${VCFFILES}))) | |
$ make gunzip -c Sample18.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target0_1 gunzip -c Sample13.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target1_2 LC_ALL=C comm -12 target0_1 target1_2 > target0_2 gunzip -c Sample1.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target2_3 gunzip -c Sample19.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target3_4 gunzip -c Sample12.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target4_5 LC_ALL=C comm -12 target3_4 target4_5 > target3_5 LC_ALL=C comm -12 target2_3 target3_5 > target2_5 LC_ALL=C comm -12 target0_2 target2_5 > target0_5 gunzip -c Sample17.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target5_6 gunzip -c Sample16.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target6_7 LC_ALL=C comm -12 target5_6 target6_7 > target5_7 gunzip -c Sample9.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target7_8 gunzip -c Sample15.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target8_9 gunzip -c Sample5.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target9_10 LC_ALL=C comm -12 target8_9 target9_10 > target8_10 LC_ALL=C comm -12 target7_8 target8_10 > target7_10 LC_ALL=C comm -12 target5_7 target7_10 > target5_10 LC_ALL=C comm -12 target0_5 target5_10 > target0_10 gunzip -c Sample14.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target10_11 gunzip -c Sample3.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target11_12 LC_ALL=C comm -12 target10_11 target11_12 > target10_12 gunzip -c Sample11.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target12_13 gunzip -c Sample2.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target13_14 gunzip -c Sample6.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target14_15 LC_ALL=C comm -12 target13_14 target14_15 > target13_15 LC_ALL=C comm -12 target12_13 target13_15 > target12_15 LC_ALL=C comm -12 target10_12 target12_15 > target10_15 gunzip -c Sample20.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target15_16 gunzip -c Sample10.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target16_17 LC_ALL=C comm -12 target15_16 target16_17 > target15_17 gunzip -c Sample4.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target17_18 gunzip -c Sample8.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target18_19 gunzip -c Sample7.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target19_20 LC_ALL=C comm -12 target18_19 target19_20 > target18_20 LC_ALL=C comm -12 target17_18 target18_20 > target17_20 LC_ALL=C comm -12 target15_17 target17_20 > target15_20 LC_ALL=C comm -12 target10_15 target15_20 > target10_20 LC_ALL=C comm -12 target0_10 target10_20 > target0_20and here is the generated workflow (drawn with make2graph ).
:

That's it
Pierre.
No comments:
Post a Comment