26 May 2013

Filtering a VCF with javascript

This is my answer for that question on biostar. I wrote a java program filtering the VCF with the rhino javascript-engine.
I put the code on github: see https://github.com/lindenb/jvarkit#-filtering-vcf-with-javascript-rhino-.

For each variation, the script binds the following variables:

  • variant : the current variation; a org.broadinstitute.variant.variantcontext.VariantContext ( http://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/variantcontext/VariantContext.java )
  • header : the VCF header org.broadinstitute.variant.vcf.VCFHeader ( http://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/vcf/VCFHeader.java).
and evaluate the user script. This user script should return '1' or true if the current VCF file should be printed.

For example, you want to keep the variants having at least two samples having a depth (DP) greater that 200.
The script would be:
function myfilterFunction()
    {
    var samples=header.genotypeSamples;
    var countOkDp=0;
    for(var i=0; i< samples.size();++i)
        {
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = variant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp < 200 ) countOkDp++;
        }
    return (countOkDp>2)
    }
myfilterFunction();

Example:

curl -s "https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf" |\
java -jar  dist/vcffilterjs.jar  -f filter.js |\
grep -v "##"


#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BLANK NA12878 NA12891 NA12892 NA19238 NA19239 NA19240
chr22 42526449 . T A 151.47 . AC=1;AF=0.071;AN=14;BaseQRankSum=2.662;DP=1226;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=41.2083;MQ=240.47;MQ0=0;MQRankSum=0.578;QD=4.89;ReadPosRankSum=3.611 GT:AD:DP:GQ:PL 0/1:23,8:31:99:190,0,694 0/0:188,0:190:99:0,478,5376 0/0:187,0:187:99:0,493,5322 0/0:247,0:249:99:0,634,6728 0/0:185,0:185:99:0,487,5515 0/0:202,0:202:99:0,520,5857 0/0:181,1:182:99:0,440,5362
chr22 42526634 . T C 32.60 . AC=1;AF=0.071;AN=14;BaseQRankSum=1.147;DP=1225;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=50.0151;MQ=240.65;MQ0=0;MQRankSum=1.151;QD=1.30;ReadPosRankSum=1.276 GT:AD:DP:GQ:PL 0/1:21,4:25:71:71,0,702 0/0:187,2:189:99:0,481,6080 0/0:233,0:233:99:0,667,7351 0/0:230,0:230:99:0,667,7394 0/0:174,1:175:99:0,446,5469 0/0:194,2:196:99:0,498,6239 0/0:174,0:175:99:0,511,5894
chr22 42527793 rs1080989 C T 3454.66 . AC=2;AF=0.167;AN=12;BaseQRankSum=-3.007;DB;DP=1074;DS;Dels=0.01;FS=0.000;HRun=1;HaplotypeScore=75.7865;MQ=209.00;MQ0=0;MQRankSum=3.014;QD=9.36;ReadPosRankSum=0.618 GT:AD:DP:GQ:PL ./. 0/1:72,90:162:99:1699,0,1767 0/1:103,96:202:99:1756,0,2532 0/0:188,0:188:99:0,526,5889 0/0:160,0:160:99:0,457,4983 0/0:197,0:198:99:0,544,6100 0/0:156,0:156:99:0,439,5041

That's it,

Pierre

1 comment:

Unknown said...

This is similar to functionality that I included as part of varitas: either with the '-f' option that is provided a file containing a filter function which is wrapped in an iterator or the '-j' option which evaluates a boolean javascript expression to determine whether a variant should pass filtering.

One thing that reduces verbosity of the code for a boolean filter is to wrap the provided javascript in a 'with (variant) { }' block so that the 'variant' object is implicit.

I had started this when the VCF/VariantContext code had moved out of tribble and into GATK but not yet into picard, so to keep dependencies down I handle VCF parsing with custom code, but would consider moving to the code now in picard to be more consistent with other projects (and avoid having to maintain VCF parsing code myself).