Showing posts with label ngs. Show all posts
Showing posts with label ngs. Show all posts

22 September 2016

Writing a Custom ReadFilter for the GATK, my notebook.


The GATK contains a set of predefined read filters that "filter or transfer incoming SAM/BAM data files":

With the help of the modular architecture of the GATK, it's possible to write a custom ReadFilter. In this post I'll write a ReadFilter that removes the reads if they contain a specific sequence in a soft-clipped segment.
The Read filters extend the class org.broadinstitute.gatk.engine.filters.ReadFilter:
public abstract class ReadFilter implements SamRecordFilter {
    public void initialize(GenomeAnalysisEngine engine) {}
    public boolean filterOut(final SAMRecord first, final SAMRecord second) {
        throw new UnsupportedOperationException("Paired filter not implemented: " + this.getClass());
    }
}
Thus the implementation of our filter, named My is defined below:
The class is compiled and packaged using the gatk itself as a library :
mkdir -p  org/broadinstitute/gatk/engine/filters
cp MyFilter.java org/broadinstitute/gatk/engine/filters
javac -cp /path/to/GenomeAnalysisTK.jar org/broadinstitute/gatk/engine/filters/MyFilter.java
jar cvf myfilter.jar org
rm -rf org
The GATK main class is org.broadinstitute.gatk.engine.CommandLineGATK, we can now check that there is a filter named My in the list of read Filters.
$ java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar  org.broadinstitute.gatk.engine.CommandLineGATK  \
  -T ReadLengthDistribution \
  --read_filter generate_and_error_please
(...)
##### ERROR MESSAGE: Invalid command line: Malformed read filter: Read filter generate_and_error_please not found. Available read filters:
##### ERROR 
##### ERROR                                 FilterName        Documentation
##### ERROR                           MissingReadGroup        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MissingReadGroupFilter.php
##### ERROR                                         My        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MyFilter.php
##### ERROR                    NoOriginalQualityScores        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_NoOriginalQualityScoresFilter.php
We can now use the GATK tools with or without the 'My' filter .
java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \
  -T ReadLengthDistribution \
  -R /path/to/ref.fa \
  -I /path/to/test.bam \
  -L 22 -o dist.ATGATA.tbl \
  --read_filter My --clippattern ATGATA 
(...)
INFO  10:53:32,412 MicroScheduler - 32 reads were filtered out during the traversal out of approximately 12434 total reads (0.26%) 
INFO  10:53:32,412 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  10:53:32,412 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  10:53:32,413 MicroScheduler -   -> 32 reads (0.26% of total) failing MyFilter

$ cat dist.ATGATA.tbl
#:GATKReport.v1.1:1
#:GATKTable:2:130:%s:%s:;
#:GATKTable:ReadLengthDistribution:Table of read length distributions
readLength  Sample
        19        6
        20        4
        21        4
        22       13
        23        4
        24       12



while without 'My' the output is:

java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \
  -T ReadLengthDistribution \
  -R /path/to/ref.fa \
  -I /path/to/test.bam \
  -L 22 -o dist.all.tbl
(...)
INFO  10:53:35,713 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12434 total reads (0.00%) 
INFO  10:53:35,713 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  10:53:35,714 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

$ cat  dist.all.tbl
#:GATKReport.v1.1:1
#:GATKTable:2:130:%s:%s:;
#:GATKTable:ReadLengthDistribution:Table of read length distributions
readLength  Sample
        19        6
        20        4
        21        4
        22       13
        23        4
        24       12

The Makefile



That's it,
Pierre

09 September 2016

Playing with #magicblast, the #NCBI Short read mapper. My notebook

NCBI MAGIC Blast was recently mentioned by BioMickWatson on twitter.



Here, I'll be playing with magicblast and I'll compare its output with bwa (Makefile below).

First, here is an extract of the manual for magicblast.
USAGE
DESCRIPTION
   Short read mapper

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
    * Incompatible with:  sra
 -infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>
   Input format for sequences
   Default = `fasta'
    * Incompatible with:  sra
 -paired
   Input query sequences are paired
 -query_mate <File_In>
   FASTA file with mates for query sequences (if given in another file)
    * Requires:  query
 -sra <String>
   Comma-separated SRA accessions
    * Incompatible with:  query, infmt

 *** Formatting options
 -outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >
   alignment view options:
   sam = SAM format,
   tabular = Tabular format,
   text ASN.1
   Default = `sam'

Indexing the reference for magicblast

I've indexed the human chr22 using the good old NCBI makeblastdb:
$ wget -O "chr22.fa.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz"
$ gunzip -f chr22.fa.gz
$ makeblastdb -in chr22.fa -dbtype nucl -title chr22

Mapping Paired-End reads with magicblast

First I've removed the read names' suffixes '/1' and '/2' because they still appear in the final bam.
gunzip -c R1.aa.fq.gz | sed 's/\/1$$//' | gzip > R1.fq.gz
gunzip -c R2.aa.fq.gz | sed 's/\/2$$//' | gzip > R2.fq.gz
The reads are then mapped with magicblast using the following command:
magicblast -db chr22 \
 -infmt fastq -paired \
 -query R1.fq.gz \
 -query_mate R2.fq.gz |\
 sed 's/gnl\|BL_ORD_ID\|0/chr22/g' |\
 samtools sort -o child.magic.bam -T child.magic --reference chr22.fa -O BAM
As far as I could see, there was no option to specify the read group (RG).

Here,I've transformed the name of the contig because it looks like a blast-id identifier:
I've also mapped the reads using bwa:

bwa mem  chr22.fa R1.fq.gz R2.fq.gz |\
 samtools sort -o child.bwa.bam -T child.bwa --reference chr22.fa -O BAM

BAM headers

Here is the BAM header for magicblast: Magic blast uses the spec v1.2 and has a Group-Order flag.

@HD VN:1.2 GO:query SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:0 PN:magicblast magicblast -db chr22.fa -infmt fastq -paired -query R1.fq.gz -query_mate R2.fq.gz

And the BAM header for bwa:
@HD VN:1.3 SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:bwa PN:bwa VN:0.7.15-r1140 bwa mem chr22.fa R1.fq.gz R2.fq.gz

Samtools samflags

for magicblast:
24387 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24387 + 0 mapped (100.00% : N/A)
24387 + 0 paired in sequencing
12222 + 0 read1
12222 + 0 read2
24330 + 0 properly paired (99.77% : N/A)
24387 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
57 + 0 with mate mapped to a different chr
57 + 0 with mate mapped to a different chr (mapQ>=5)

for bwa:
24001 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1 + 0 supplementary
0 + 0 duplicates
23976 + 0 mapped (99.90% : N/A)
24000 + 0 paired in sequencing
12000 + 0 read1
12000 + 0 read2
23840 + 0 properly paired (99.33% : N/A)
23952 + 0 with itself and mate mapped
23 + 0 singletons (0.10% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Validationg with picard

Validation of child.magic.bam with picard generates many errors:
$ java -jar picard.jar  ValidateSamFile  I=child.magic.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
ERROR: Header version: 1.2 does not match any of the acceptable versions: 1.0, 1.3, 1.4, 1.5
ERROR: Record 4, Read name HWI-1KL149:97:C6Y6VACXX:4:2306:7588:23627, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 6, Read name HWI-1KL149:97:C6Y6VACXX:5:2303:6772:28953, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 15, Read name HWI-1KL149:97:C6Y6VACXX:4:1207:16508:83772, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 17, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 19, Read name HWI-1KL149:97:C6Y6VACXX:5:1315:20109:45547, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 29, Read name HWI-1KL149:97:C6Y6VACXX:4:2313:12478:85202, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 36, Read name HWI-1KL149:97:C6Y6VACXX:4:1202:11437:96159, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 40, Read name HWI-1KL149:97:C6Y6VACXX:4:1213:16611:87818, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 45, Read name HWI-1KL149:97:C6Y6VACXX:4:1208:7944:83271, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 49, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 22, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 25, Read name HWI-1KL149:97:C6Y6VACXX:4:2209:8319:82198, Mate negative strand flag does not match read negative strand flag of mate
(...)
while there is ~no error for child.bwa.bam:
$ java -jar picard.jar  ValidateSamFile  I=child.bwa.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
[Fri Sep 09 16:09:28 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=78643200



Variant Calling

Both BAMs are mapped with samtools/vcftools (min DP=10)
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.bwa.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.bwa.vcf -
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.magic.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.magic.vcf -
Number of variants:
$ grep -v "#" child.magic.vcf  | wc -l
111
$ grep -v "#" child.bwa.vcf  | wc -l
166

Here is the diff for CHROM/POS/REF/ALT:
  • Uniq to BWA: 'comm -13 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 64 variants
  • Uniq to Magic: 'comm -23 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 9 variants
  • Common variants: 'comm -12 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 102 variants

The Makefile

SHELL=/bin/bash
data.dir=${HOME}/src/DATA
ncbi.bin=${HOME}/packages/magicblast/bin
REF=chr22.fa
samtools.exe=${HOME}/packages/samtools/samtools 
bwa.exe=${HOME}/packages/bwa-0.7.15/bwa 

all: child.magic.bam child.bwa.bam

R1.fq.gz : ${HOME}/src/DATA/child.R1.aa.fq.gz 
 gunzip -c $< | sed 's/\/1$$//' | gzip > $@
R2.fq.gz : ${HOME}/src/DATA/child.R2.aa.fq.gz 
 gunzip -c $< | sed 's/\/2$$//' | gzip > $@

child.bwa.bam : ${REF}.bwt R1.fq.gz R2.fq.gz 
 ${bwa.exe} mem  ${REF} $(word 2,$^) $(word 3,$^) |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 


child.magic.bam : ${REF}.nin R1.fq.gz R2.fq.gz 
 ${ncbi.bin}/magicblast -db ${REF} -infmt fastq -paired -query $(word 2,$^) -query_mate $(word 3,$^) |\
 sed 's/gnl\|BL_ORD_ID\|0/$(basename ${REF})/g' |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 

${REF}.bwt : ${REF}
 ${bwa.exe} index $<

${REF}.nin : ${REF}
 ${ncbi.bin}/makeblastdb -in $< -dbtype nucl -title $(basename ${REF})

${REF} :
 wget -O "$@.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@.gz"
 gunzip -f $@.gz

That's it,
Pierre

17 May 2016

finding new intron-exon junctions using the public Encode RNASeq data

I've been asked to look for some new / suspected / previously uncharacterized intron-exon junctions in public RNASeq data.
I've used the BAMs under http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/.

The following command is used to build the list of BAMs:

curl -s  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/" |\
tr ' <>"' "\n" | grep -F .bam | grep -v bai | sort | uniq | sed 's/.bam$//' | sed 's/$/ \\/' 

wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400SplicesRep2V2 \
(...)

This list is inserted as a list named SAMPLES a Makefile.

For each BAM, we use samtools to retrieve the reads in the region(s)) of interest. The reads are then filtered with samjs (https://github.com/lindenb/jvarkit/wiki/SamJS) to only keep the reads carrying an intron-exon junction at the desired location(s). Basically, the javascript-based filter loops over the CIGAR string of the read, computes the genomic interval skipped when the cigar operator is a deletion or a skipped region/intron. The read is printed if it describes the new intron-exon junction.

All in one:




That's it,

Pierre



04 March 2016

Now in picard: two javascript-based tools filtering BAM and VCF files.


SamJS and VCFFilterJS are two tools I wrote for jvarkit. Both tools use the embedded java javascript engine to filter BAM or VCF file.
To get a broader audience, I've copied those functionalities to Picard in 'FilterSamReads' and 'FilterVcf'.

FilterSamReads

FilterSamReads filters a SAM or BAM file with a javascript expression using the java javascript-engine.
The script puts the following variables in the script context: 'record' a SamRecord and 'header' a SAMFileHeader. Last value of the script should be a boolean to tell wether we should accept or reject the record.

The script samFilter.js

/** accept record if second base of DNA is a A */
function accept(r)
 {
 return r.getReadString().length()>2 &&
  r.getReadString().substring(1,2)=="A";
 }

accept(record);

Invoke and output

$ java -jar picard-tools-2.1.1/picard.jar \
 FilterSamReads I=in.sam  O=out.sam \
 JS=samFilter.js FILTER=includeJavascript
 
$ cat out.sam | cut -f 10 | cut -c 2 | sort | uniq

A

FilterVcf

FilterVcf one or more hard filters to a VCF file to filter out genotypes and variants.
Filters a VCF file with a javascript expression interpreted by the java javascript engine. The script puts the following variables in the script context: 'variant' a VariantContext and 'header' a VCFHeader. Last value of the script should be a boolean to tell wether we should accept or reject the record.

The script variantFilter.js

/** prints a VARIATION if two samples at least have a DP>100 */ 
function myfilterFunction(thevariant)
    {
    var samples=header.genotypeSamples;
    var countOkDp=0;

    for(var i=0; i< samples.size();++i)
        {
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = thevariant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp > 100 ) countOkDp++;
        }
    return (countOkDp>2)
    }
myfilterFunction(variant)

Invoke and output

java -jar picard-tools-2.1.1/picard.jar FilterVcf \
 I=in.vcf O=out.vcf \
 JS=variantFilter.js
 
$ grep -v '#' jeter.vcf | cut -f 7 | grep variantFilter | wc -l
23


That's it,
Pierre

24 February 2016

Registering a tool in the @ELIXIREurope regisry using XML, XSLT, JSON and curl. My notebook.

The Elixir Registry / pmid:26538599 "A portal to bioinformatics resources world-wide. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools."
In this post, I will describe how I've used the bio.tools API to register some tools from jvarkit.

Authenticate with your credentials

using curl, the 'bio.tools' service returns a authentication token.

$ curl -s \
 -H "Content-type: application/json" \
 -X POST \
 -d '{"username":"my-login@univ-nantes.fr","password":"password1234"}' \
 https://bio.tools/api/auth/login |\
 python -m json.tool
{
    "token": "74dedea023dbad8ecda49ac57bb1074acd794f"
}

Creating a JSON describing the tool.

The tool I'm goind to use is VCFhead. A very simple tool printing the first variants of a VCF file. In jvarkit I don't write the code parsing the arguments, everything is described using a XML file that is going to be processed with a XSTL stylesheet to generate an abstract java code handling the options, etc....

xsltproc command2java VcfHead.xml > AbstractVcfHead.java

For VcfHead the XML descriptor is available here: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/VcfHead.xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<app xmlns="http://github.com/lindenb/jvarkit/" xmlns:h="http://www.w3.org/1999/xhtml" app="VcfHead" package="com.github.lindenb.jvarkit.tools.misc" >
  <description>Print the first variants of a VCF.</description>
  <input type="vcf"/>
  <output type="vcf"/>
  <options>
    <option name="count" type="int" opt="n" longopt="count" min-inclusive="0" default="10">
      <description>number of variants to be printed</description>
    </option>
  </options>
  <documentation>
    <h:h3>Example</h:h3>
   (...)
  </documentation>
</app>

Using a first XSLT stylesheet https://github.com/lindenb/jvarkit/blob/master/src/main/resources/xsl/jsonxelixir.xsl, 'VcfHead.xml' is firstly converted to the 'infamous' JSONx (JSON+XML) format .
xsltproc jsonxelixir VcfHead.xml > VcfHead.jsonx
The JSONx file:
<?xml version="1.0"?>
<jsonx:object xmlns:jsonx="http://www.ibm.com/xmlns/prod/2009/jsonx" xmlns:c="http://github.com/lindenb/jvarkit/" xmlns="http://www.w3.org/1999/xhtml" xmlns:x="http://www.ibm.com/xmlns/prod/2009/jsonx">
  <jsonx:string name="accessibility">Public</jsonx:string>
  <jsonx:string name="affiliation">univ-nantes.fr</jsonx:string>
  <jsonx:string name="cost">Free</jsonx:string>
  <jsonx:array name="platform">
    <jsonx:string>Linux</jsonx:string>
    <jsonx:string>Mac</jsonx:string>
  </jsonx:array>
  <jsonx:string name="version">1.0</jsonx:string>
  <jsonx:string name="homepage">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
  <jsonx:array name="function">
    <jsonx:object>
      <jsonx:array name="input">
        <jsonx:object>
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          </jsonx:object>
          <jsonx:array name="dataFormat">
            <jsonx:object>
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
            </jsonx:object>
          </jsonx:array>
        </jsonx:object>
      </jsonx:array>
      <jsonx:array name="output">
        <jsonx:object>
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          </jsonx:object>
          <jsonx:string name="dataDescription">any format</jsonx:string>
          <jsonx:array name="dataFormat">
            <jsonx:object>
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
            </jsonx:object>
          </jsonx:array>
        </jsonx:object>
      </jsonx:array>
      <jsonx:array name="functionName">
        <jsonx:object>
          <jsonx:string name="term">Formatting</jsonx:string>
          <jsonx:string name="uri">http://edamontology.org/operation_0335</jsonx:string>
        </jsonx:object>
      </jsonx:array>
      <jsonx:string name="functionDescription">Print the first variants of a VCF.</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="description">Print the first variants of a VCF.</jsonx:string>
  <jsonx:object name="docs">
    <jsonx:string name="docsTermsOfUse">https://opensource.org/licenses/MIT</jsonx:string>
    <jsonx:string name="docsGithub">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsHome">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsCitationInstructions">https://github.com/lindenb/jvarkit/wiki/Citing</jsonx:string>
    <jsonx:string name="docsDownloadSource">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
    <jsonx:string name="docsDownload">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
  </jsonx:object>
  <jsonx:array name="collection">
    <jsonx:string>jvarkit</jsonx:string>
  </jsonx:array>
  <jsonx:object name="credits">
    <jsonx:array name="creditsInstitution">
      <jsonx:string>Institut du Thorax, Nantes, France</jsonx:string>
    </jsonx:array>
    <jsonx:array name="creditsDeveloper">
      <jsonx:string>Pierre Lindenbaum</jsonx:string>
    </jsonx:array>
  </jsonx:object>
  <jsonx:array name="interface">
    <jsonx:object>
      <jsonx:string name="interfaceType">Command line</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="name">VcfHead</jsonx:string>
  <jsonx:array name="topic">
    <jsonx:object>
      <jsonx:string name="term">Omics</jsonx:string>
      <jsonx:string name="uri">http://edamontology.org/topic_3391</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="license">MIT License</jsonx:string>
  <jsonx:array name="language">
    <jsonx:string>Java</jsonx:string>
  </jsonx:array>
  <jsonx:array name="resourceType">
    <jsonx:string>Tool</jsonx:string>
  </jsonx:array>
  <jsonx:string name="maturity">Stable</jsonx:string>
  <jsonx:array name="contact">
    <jsonx:object>
      <jsonx:string name="contactURL">https://github.com/lindenb</jsonx:string>
      <jsonx:string name="contactName">Pierre Lindenbaum</jsonx:string>
      <jsonx:array name="contactRole">
        <jsonx:string>Developer</jsonx:string>
        <jsonx:string>Maintainer</jsonx:string>
        <jsonx:string>Helpdesk</jsonx:string>
      </jsonx:array>
    </jsonx:object>
  </jsonx:array>
  <jsonx:object name="publications">
    <jsonx:string name="publicationsPrimaryID">doi:10.6084/m9.figshare.1425030.v1</jsonx:string>
  </jsonx:object>
</jsonx:object>

Using another XSLT stylesheet jsonx2json.xsl, the JSONx is converted to a JSON file.
xsltproc jsonx2json.xsl VcfHead.jsonx > VcfHead.json
the JSON file:
{
    "accessibility": "Public", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
        "jvarkit"
    ], 
    "contact": [
        {
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
                "Developer", 
                "Maintainer", 
                "Helpdesk"
            ], 
            "contactURL": "https://github.com/lindenb"
        }
    ], 
    "cost": "Free", 
    "credits": {
        "creditsDeveloper": [
            "Pierre Lindenbaum"
        ], 
        "creditsInstitution": [
            "Institut du Thorax, Nantes, France"
        ]
    }, 
    "description": "Print the first variants of a VCF.", 
    "docs": {
        "docsCitationInstructions": "https://github.com/lindenb/jvarkit/wiki/Citing", 
        "docsDownload": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsDownloadSource": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsGithub": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsHome": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsTermsOfUse": "https://opensource.org/licenses/MIT"
    }, 
    "function": [
        {
            "functionDescription": "Print the first variants of a VCF.", 
            "functionName": [
                {
                    "term": "Formatting", 
                    "uri": "http://edamontology.org/operation_0335"
                }
            ], 
            "input": [
                {
                    "dataFormat": [
                        {
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                        }
                    ], 
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
                    }
                }
            ], 
            "output": [
                {
                    "dataDescription": "any format", 
                    "dataFormat": [
                        {
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                        }
                    ], 
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
                    }
                }
            ]
        }
    ], 
    "homepage": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
    "interface": [
        {
            "interfaceType": "Command line"
        }
    ], 
    "language": [
        "Java"
    ], 
    "license": "MIT License", 
    "maturity": "Stable", 
    "name": "VcfHead", 
    "platform": [
        "Linux", 
        "Mac"
    ], 
    "publications": {
        "publicationsPrimaryID": "doi:10.6084/m9.figshare.1425030.v1"
    }, 
    "resourceType": [
        "Tool"
    ], 
    "topic": [
        {
            "term": "Omics", 
            "uri": "http://edamontology.org/topic_3391"
        }
    ], 
    "version": "1.0"
}

Registering the tool

Now we have the Token and the json descriptor we can add VcfHead to Elixir using curl:
curl  -H "Content-type: application/json" \
 -H "Authorization: Token 74dedea023dbad8ecda49ac57bb1074acd794f" \
 -X POST \
 -d  @path/to/VcfHead.json \
 "https://bio.tools/api/tool" |\
 python -m json.tool
output:
{
    "accessibility": "Public", 
    "additionDate": "2016-02-24T11:37:17.458Z", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
        "jvarkit"
    ], 
    "contact": [
        {
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
                "Developer", 
                "Maintainer", 
                "Helpdesk"
            ], 
            "contactURL": "https://github.com/lindenb"
(...)

VCfhead is now visible from the Elixir Registry at https://bio.tools/tool/univ-nantes.fr/VcfHead/1.0.
http://i.imgur.com/PjEMjX6.jpg

That's it,
Pierre.

29 June 2015

A BLAST to SAM converter.

Some times ago, I've received a set of Ion-Torrent /mate-reads with a poor quality. I wasn't able to align much things using bwa. I've always wondered if I could get better alignments using NCBI-BLASTN (short answer: no) . That's why I asked guyduche, my intership student to write a C program to convert the output of blastn to SAM. His code is available on github at :

The input for blast2sam is
  • the XML output of NCBI blastn (or stdin)
  • The single or pair of fastq file(s)
  • The reference sequence indexed with picard
.

Example:

fastq2fasta in.R1.fq.gz in.R2.fq.gz |\
blastn -db REFERENCE   -outfmt 5 | \
blast2bam -o result.bam -W 40 -R '@RG   ID:foo  SM:sample' - REFERENCE.dict  in.R1.fq.gz in.R2.fq.gz

Output:
@SQ SN:gi|9629357|ref|NC_001802.1|  LN:9181 
@RG ID:foo  SM:sample
@PG ID:Blast2Bam    PN:Blast2Bam    VN:0.1  CL:../../bin/blast2bam -o results.sam -W 40 -R @RG  ID:foo  SM:sample - db.dict test_1.fastq.gz test_2.fastq.gz
(...)
ERR656485.2 83  gi|9629357|ref|NC_001802.1| 715 60  180S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X6=1S =   715 -119    CCTAGTGTTGCTTGCTTTTCTTCTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAN    (),.((((,(((((,((.((.-(>69>20E>6/=>5EC@9-52?BEE::2951.)74B64=B==FFAF=A??59:>FFFDF:55GGFGF?DFGGFE868>GGGFGGGGED;FGFFGGGGGGGGGGGEFFGE9GGGGFGGGGGGGGDGECGGFGGGGGGGGGGFGGGGGEGGFGGGGGGFFGGGGGFF?EGGFFFEGGGGGGGGFEGGGEGGGFEGGGGGGGGGGDGFFCEGFGGGGGGGGGGGFFECFGGGGFGGGGGGGGGGGFCGGGGGGGGGGGGGGGGGGFGGGGGGGGF@CCA8!    NM:i:13 RG:Z:foo    AS:i:80 XB:f:148.852    XE:Z:4.07e-39
ERR656485.2 163 gi|9629357|ref|NC_001802.1| 715 60  73S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X8=106S    =   715 119 NAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTCTCATTACAAAAAAAACATACACAATAAATGATATAAGCGGAATCAACAGCATGA    !8A@CGGEFGFGCDFGGGGGGGGGGGGGGFGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGFGGGGGGGGGEGFGFGGGFFGGGGGGGGFGGGGGGGGGGGGFFFFGGGGGG=FFGGFFDGGGGGGGG8FGFGGGGGGGGGFGGGGGGGGGGFDGGFGGFGGGFFFGFF8DFDFDFFFFFFFFFBCDB<@EAFB@ABAC@CDFF?4>EEFE<*>BDAFB@FFBFF>((6<5CC.;C;=D9106(.))).)-46<<))))))))))((,(-)))()((()))    NM:i:13 RG:Z:foo    AS:i:82 XB:f:152.546    XE:Z:3.15e-40
(...)
Now, I would be interested in finding another dataset where this tool could be successfully used.
That's it,
Pierre

05 May 2015

Playing with hadoop/mapreduce and htsjdk/VCF : my notebook.

The aim of this test is to get a count of each type of variant/genotypes in a VCF file using Apache Hadoop and the java library for NGS htsjdk. My source code is available at: https://github.com/lindenb/hadoop-sandbox/blob/master/src/main/java/com/github/lindenb/hadoop/Test.java.

First, and this is my main problem, I needed to create a class 'VcfRow' that would contains the whole data about a variant. As I need to keep the information about all the semantics in the VCF header, each record contains the whole VCF header (!). I asked SO if there was an elegant way to save the header in the hadoop workflow but it currently seems that there is no such solution (http://stackoverflow.com/questions/30052859/hadoop-mapreduce-handling-a-text-file-with-a-header). This class
VcfRow must implement WritableComparable to be serialized by the hadoop pipeline. It's awfully sloooooow since we need to parse a htsjdk.variant.vcf.VCFHeader and a htsjdk.variant.vcf.VCFCodec for each new variant.

public static class VcfRow
implements WritableComparable<VcfRow>
 {
 private List<String> headerLines;
 private String line;
 private VariantContext ctx=null;
 private VCFHeader header =null;
 private VCFCodec codec=new VCFCodec();
 public VcfRow()
   {
 this.headerLines = Collections.emptyList();
 this.line="";
   }
 public VcfRow(List<String> headerLines,String line)
  {
 this.headerLines=headerLines; 
 this.line=line;
  }
 
@Override
public void write(DataOutput out) throws IOException
 {
 out.writeInt(this.headerLines.size());
 for(int i=0;i< this.headerLines.size();++i)
  {
  out.writeUTF(this.headerLines.get(i));
  }
 byte array[]=line.getBytes();
 out.writeInt(array.length);
 out.write(array);
 }

@Override
public void readFields(DataInput in) throws IOException
 {
 int n= in.readInt();
 this.headerLines=new ArrayList<String>(n);
 for(int i=0;i<n;++i) this.headerLines.add(in.readUTF());
 n = in.readInt();
 byte array[]=new byte[n];
 in.readFully(array);
 this.line=new String(array);
 this.codec=new VCFCodec();
 this.ctx=null;
 this.header=null;
 }

public VCFHeader getHeader()
 {
 if(this.header==null)
  {
  this.header = (VCFHeader)this.codec.readActualHeader(new MyLineIterator());
  }
 return this.header;
 }

public VariantContext getVariantContext()
 {
 if(this.ctx==null)
  {
  if(this.header==null) getHeader();//force decode header
  this.ctx=this.codec.decode(this.line);
  }
 return this.ctx;
 }

@Override
public int compareTo(VcfRow o)
 {
 int i = this.getVariantContext().getContig().compareTo(o.getVariantContext().getContig());
 if(i!=0) return i;
 i = this.getVariantContext().getStart() - o.getVariantContext().getStart();
 if(i!=0) return i;
 i =  this.getVariantContext().getReference().compareTo( o.getVariantContext().getReference());
 if(i!=0) return i;
 return this.line.compareTo(o.line);
 }

   private  class MyLineIterator
 extends AbstractIterator<String>
 implements LineIterator
 { 
 int index=0;
 @Override
 protected String advance()
  {
  if(index>= headerLines.size()) return null;
  return headerLines.get(index++);
  }
 }
}

Then a special InputFormat is created for the VCF format. As we need to keep a trace of the Header, this file declares `isSplitable==false`. The class VcfInputFormat creates an instance of RecordReader reading the whole VCF header the first time it is invoked with the method `initialize`. This 'VcfRecordReader' creates a new VcfRow for each line.

public static class VcfInputFormat extends FileInputFormat<LongWritable, VcfRow>
   {
 private List<String> headerLines=new ArrayList<String>();
 
 @Override
 public RecordReader<LongWritable, VcfRow> createRecordReader(InputSplit split,
   TaskAttemptContext context) throws IOException,
   InterruptedException {
  return new VcfRecordReader();
  }  
 @Override
 protected boolean isSplitable(JobContext context, Path filename) {
  return false;
  }
  
 //LineRecordReader
  private class VcfRecordReader extends RecordReader<LongWritable, VcfRow>
    {
  private LineRecordReader delegate=new LineRecordReader();
  public VcfRecordReader() throws IOException
    {
    }
  
   @Override
  public void initialize(InputSplit genericSplit,
    TaskAttemptContext context) throws IOException {
    delegate.initialize(genericSplit, context);
   while( delegate.nextKeyValue())
    {
    String row = delegate.getCurrentValue().toString();
    if(!row.startsWith("#")) throw new IOException("Bad VCF header");
    headerLines.add(row);
    if(row.startsWith("#CHROM")) break;
    }
    }
   @Override
  public LongWritable getCurrentKey() throws IOException,
    InterruptedException {
   return delegate.getCurrentKey();
    }
   
   @Override
  public VcfRow getCurrentValue() throws IOException,
    InterruptedException {
   Text row = this.delegate.getCurrentValue();
   return new VcfRow(headerLines,row.toString());
    }
   
   @Override
  public float getProgress() throws IOException, InterruptedException {
   return this.delegate.getProgress();
    }
   
   @Override
  public boolean nextKeyValue() throws IOException,
    InterruptedException {
   return this.delegate.nextKeyValue();
    }
   
   @Override
  public void close() throws IOException {
    delegate.close();
   }
     }
   }

The hadoop mapper uses the information of each VCFrow and produce a count of each category:
public static class VariantMapper
   extends Mapper<LongWritable, VcfRow, Text, IntWritable>{

 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, VcfRow vcfRow, Context context ) throws IOException, InterruptedException {
  VariantContext ctx = vcfRow.getVariantContext();
  if( ctx.isIndel())
   { 
   word.set("ctx_indel");
      context.write(word, one);
   }
  if( ctx.isBiallelic())
   { 
   word.set("ctx_biallelic");
      context.write(word, one);
   }
  if( ctx.isSNP())
   { 
   word.set("ctx_snp");
   context.write(word, one);
   } 
  if( ctx.hasID())
   { 
   word.set("ctx_id");
   context.write(word, one);
   } 
  word.set("ctx_total");
  context.write(word, one);
 
  for(String sample: vcfRow.getHeader().getSampleNamesInOrder())
   {
   Genotype g =vcfRow.getVariantContext().getGenotype(sample);
   word.set(sample+" "+ctx.getType()+" "+g.getType().name());
   context.write(word, one);
   }

  }
 }

The Reducer computes the sum of each category:
public static class IntSumReducer
   extends Reducer<Text,IntWritable,Text,IntWritable> {
 private IntWritable result = new IntWritable();

 public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException {
   int sum = 0;
   for (IntWritable val : values) {
  sum += val.get();
    }
   result.set(sum);
   context.write(key, result);
 }
}

and here is the main program:
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "snp count");
    job.setJarByClass(Test.class);
    job.setMapperClass(VariantMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    Path inputPath=new Path(args[0]);
    job.setInputFormatClass(VcfInputFormat.class);
    FileInputFormat.addInputPath(job, inputPath);
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

Download, compile, Run:
lindenb@hardyweinberg:~/src/hadoop-sandbox$ make -Bn
rm -rf hadoop-2.7.0
curl -L -o hadoop-2.7.0.tar.gz "http://apache.spinellicreations.com/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz"
tar xvfz hadoop-2.7.0.tar.gz
rm hadoop-2.7.0.tar.gz
touch -c hadoop-2.7.0/bin/hadoop
rm -rf htsjdk-1.130
curl -L -o 1.130.tar.gz "https://github.com/samtools/htsjdk/archive/1.130.tar.gz"
tar xvfz 1.130.tar.gz
rm 1.130.tar.gz
(cd htsjdk-1.130 && ant )
mkdir -p tmp dist
javac -d tmp -cp hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0.jar:hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.0.jar:hadoop-2.7.0/share/hadoop/common/lib/hadoop-annotations-2.7.0.jar:hadoop-2.7.0/share/hadoop/common/lib/log4j-1.2.17.jar:htsjdk-1.130/dist/commons-logging-1.1.1.jar:htsjdk-1.130/dist/htsjdk-1.130.jar:htsjdk-1.130/dist/commons-jexl-2.1.1.jar:htsjdk-1.130/dist/snappy-java-1.0.3-rc3.jar -sourcepath src/main/java src/main/java/com/github/lindenb/hadoop/Test.java 
jar cvf dist/test01.jar -C tmp .
rm -rf tmp
mkdir -p input
curl -o input/CEU.exon.2010_09.genotypes.vcf.gz "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/exon/snps/CEU.exon.2010_09.genotypes.vcf.gz"
gunzip -f input/CEU.exon.2010_09.genotypes.vcf.gz
rm -rf output
HADOOP_CLASSPATH=htsjdk-1.130/dist/commons-logging-1.1.1.jar:htsjdk-1.130/dist/htsjdk-1.130.jar:htsjdk-1.130/dist/commons-jexl-2.1.1.jar:htsjdk-1.130/dist/snappy-java-1.0.3-rc3.jar hadoop-2.7.0/bin/hadoop jar dist/test01.jar com.github.lindenb.hadoop.Test \
   input/CEU.exon.2010_09.genotypes.vcf output
cat output/*

Here is the output of the last command:

15/05/05 17:18:34 INFO input.FileInputFormat: Total input paths to process : 1
15/05/05 17:18:34 INFO mapreduce.JobSubmitter: number of splits:1
15/05/05 17:18:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1186897577_0001
15/05/05 17:18:34 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/05/05 17:18:34 INFO mapreduce.Job: Running job: job_local1186897577_0001
15/05/05 17:18:34 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/05/05 17:18:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/05/05 17:18:34 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/05/05 17:18:34 INFO mapred.LocalJobRunner: Waiting for map tasks
15/05/05 17:18:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1186897577_0001_m_000000_0
15/05/05 17:18:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/05/05 17:18:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/05/05 17:18:34 INFO mapred.MapTask: Processing split: file:/home/lindenb/src/hadoop-sandbox/input/CEU.exon.2010_09.genotypes.vcf:0+2530564
15/05/05 17:18:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/05/05 17:18:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/05/05 17:18:34 INFO mapred.MapTask: soft limit at 83886080
15/05/05 17:18:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/05/05 17:18:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/05/05 17:18:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/05/05 17:18:35 INFO mapreduce.Job: Job job_local1186897577_0001 running in uber mode : false
15/05/05 17:18:35 INFO mapreduce.Job:  map 0% reduce 0%
15/05/05 17:18:36 INFO mapred.LocalJobRunner: 
15/05/05 17:18:36 INFO mapred.MapTask: Starting flush of map output
15/05/05 17:18:36 INFO mapred.MapTask: Spilling map output
15/05/05 17:18:36 INFO mapred.MapTask: bufstart = 0; bufend = 7563699; bufvoid = 104857600
15/05/05 17:18:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 24902536(99610144); length = 1311861/6553600
15/05/05 17:18:38 INFO mapred.MapTask: Finished spill 0
15/05/05 17:18:38 INFO mapred.Task: Task:attempt_local1186897577_0001_m_000000_0 is done. And is in the process of committing
(...)
NA12843 SNP HOM_REF 2515
NA12843 SNP HOM_VAR 242
NA12843 SNP NO_CALL 293
NA12872 SNP HET 394
NA12872 SNP HOM_REF 2282
NA12872 SNP HOM_VAR 188
NA12872 SNP NO_CALL 625
NA12873 SNP HET 336
NA12873 SNP HOM_REF 2253
NA12873 SNP HOM_VAR 184
NA12873 SNP NO_CALL 716
NA12874 SNP HET 357
NA12874 SNP HOM_REF 2395
NA12874 SNP HOM_VAR 229
NA12874 SNP NO_CALL 508
NA12878 SNP HET 557
NA12878 SNP HOM_REF 2631
NA12878 SNP HOM_VAR 285
NA12878 SNP NO_CALL 16
NA12889 SNP HET 287
NA12889 SNP HOM_REF 2110
NA12889 SNP HOM_VAR 112
NA12889 SNP NO_CALL 980
NA12890 SNP HET 596
NA12890 SNP HOM_REF 2587
NA12890 SNP HOM_VAR 251
NA12890 SNP NO_CALL 55
NA12891 SNP HET 609
NA12891 SNP HOM_REF 2591
NA12891 SNP HOM_VAR 251
NA12891 SNP NO_CALL 38
NA12892 SNP HET 585
NA12892 SNP HOM_REF 2609
NA12892 SNP HOM_VAR 236
NA12892 SNP NO_CALL 59
ctx_biallelic 3489
ctx_id 3489
ctx_snp 3489
ctx_total 3489


that's it,
Pierre

28 February 2015

Integrating a java program in #usegalaxy.

This is my notebook for the integration of java programs in https://usegalaxy.org/ .


create a directory for your tools under ${galaxy-root}/tools


mkdir ${galaxy-root}/tools/jvarkit

put all the required jar files and the XML files describing your tools (see below) in this new directory:

$ ls ${galaxy-root}/tools/jvarkit/
commons-jexl-2.1.1.jar
groupbygene.jar
htsjdk-1.128.jar
vcffilterjs.jar
vcffilterso.jar
vcfhead.jar
vcftail.jar
vcftrio.jar
commons-logging-1.1.1.jar
groupbygene.xml
snappy-java-1.0.3-rc3.jar
vcffilterjs.xml
vcffilterso.xml
vcfhead.xml
vcftail.xml
vcftrio.xml

Each tool is described with a XML file whose schema is available at: https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax . For example here is a simple file describing the tool VcfHead which prints the very first variants of a VCF file:


<?xml version="1.0"?>
<tool id="com.github.lindenb.jvarkit.tools.misc.VcfHead" version="1.0.0" name="vcfhead">
  <description>Print first variants of a VCF</description>
  <requirements>
    <requirement type="binary">java</requirement>
  </requirements>
  <command>(gunzip -c ${input} || cat ${input}) | java -cp $__tool_directory__/commons-jexl-2.1.1.jar:$__tool_directory__/commons-logging-1.1.1.jar:$__tool_directory__/htsjdk-1.128.jar:$__tool_directory__/snappy-java-1.0.3-rc3.jar:$__tool_directory__/vcfhead.jar com.github.lindenb.jvarkit.tools.misc.VcfHead -n '${num}' -o ${output}.vcf.gz &amp;&amp; mv ${output}.vcf.gz ${output}</command>
  <inputs>
    <param format="vcf" name="input" type="data" label="VCF input"/>
    <param name="num" type="integer" label="Number of variants" min="0" value="10"/>
  </inputs>
  <outputs>
    <data format="vcf" name="output"/>
  </outputs>
  <stdio>
    <exit_code range="1:"/>
    <exit_code range=":-1"/>
  </stdio>
  <help/>
</tool>

The input file is described by

<param format="vcf" name="input" type="data" label="VCF input"/>

The number of lines is declared in:

<param name="num" type="integer" label="Number of variants" min="0" value="10"/>

Those two variables will be replaced in the command line at runtime by galaxy.
The command line is

(gunzip -c ${input} || cat ${input}) | \
java -cp $__tool_directory__/commons-jexl-2.1.1.jar:$__tool_directory__/commons-logging-1.1.1.jar:$__tool_directory__/htsjdk-1.128.jar:$__tool_directory__/snappy-java-1.0.3-rc3.jar:$__tool_directory__/vcfhead.jar \
 com.github.lindenb.jvarkit.tools.misc.VcfHead \
 -n '${num}' -o ${output}.vcf.gz  &amp;&amp; \
mv ${output}.vcf.gz ${output}

it starts with (gunzip -c ${input} || cat ${input}) because we don't know if the input will be gzipped.
The main problem, here, is to set the CLASSPATH and tell java where to find the jar libraries. With the help of @pjacock and @jmchilton I learned that the recent release of galaxy defines a variable $__tool_directory__ defining the location of your repository, so you'll just have to append the jar file to this variable:

$__tool_directory__/commons-jexl-2.1.1.jar:$__tool_directory__/commons-logging-1.1.1.jar:....

You'll need to declare the new tools in ${galaxy-root}/config/tool_conf.xml

(...)
</section>
<section id="jvk" name="JVARKIT">
<tool file="jvarkit/vcffilterjs.xml"/>
<tool file="jvarkit/vcfhead.xml"/>
<tool file="jvarkit/vcftail.xml"/>
<tool file="jvarkit/vcffilterso.xml"/>
<tool file="jvarkit/vcftrio.xml"/>
<tool file="jvarkit/vcfgroupbygene.xml"/>
</section>
</toolbox>

Your tools are now available in the 'tools' menu

leftmenu

Clicking on a link will make galaxy displaying a form for your tool:

screenshot

As far as I can see for now, making a tar archive of your tool directory and uploading it in the galaxy toolshed ( https://toolshed.g2.bx.psu.edu/repository ), will make your tools available to the scientific community.

toolshed



That's it,
Pierre

04 December 2014

XML+XSLT = #Makefile -based #workflows for #bioinformatics

I've recently read some conversations on Twitter about Makefile-based bioinformatics workflows. I've suggested on biostars.org (Standard simple format to describe a bioinformatics analysis pipeline) that a XML file could be used to describe a model of data and XSLT could transform this model to a Makefile-based workflow. I've already explored this idea in a previous post (Generating a pipeline of analysis (Makefile) for NGS with xslt. ) and in our lab, we use JSON and jsvelocity to generate our workflows (e.g: https://gist.github.com/lindenb/3c07ca722f793cc5dd60).
In the current post, I'll describe my github repository containing a complete XML+XSLT example: https://github.com/lindenb/ngsxml.

Download the data

git clone "https://github.com/lindenb/ngsxml.git"


The model
The XML model is self-explanatory:
<?xml version="1.0" encoding="UTF-8"?>
<model name="myProject" description="my project" directory="OUT">
  <project name="Proj1">
    <sample name="Sample1">
      <fastq>
        <for>test/fastq/sample_1_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_01_R2.fastq.gz</rev>
      </fastq>
      <fastq id="groupid2" lane="2" library="lib1" platform="ILMN" median-size="98">
        <for>test/fastq/sample_1_02_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_02_R2.fastq.gz</rev>
      </fastq>
    </sample>
    <sample name="Sample2">
      <fastq>
        <for>test/fastq/sample_2_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_2_01_R2.fastq.gz</rev>
      </fastq>
    </sample>
  </project>
</model>


Validating the model:
This XML document is possibly validated with a XML schema:
$ xmllint  --schema xsd/schema01.xsd --noout test/model01.xml
test/model01.xml validates


Generating the Makefile:
The XML document is transformed into a Makefile using the following XSLT stylesheet: https://github.com/lindenb/ngsxml/blob/master/stylesheets/model2make.xsl
xsltproc --output Makefile stylesheets/model2make.xsl test/model01.xml
The Makefile:
# Name
#  myProject
# Description:
#  my project
# 
include config.mk
OUTDIR=OUT
BINDIR=$(abspath ${OUTDIR})/bin


# if tools are undefined
bwa.exe ?=${BINDIR}/bwa-0.7.10/bwa
samtools.exe ?=${BINDIR}/samtools-0.1.19/samtools
bcftools.exe ?=${BINDIR}/samtools-0.1.19/bcftools/bcftools
tabix.exe ?=${BINDIR}/tabix-0.2.6/tabix
bgzip.exe ?=${BINDIR}/tabix-0.2.6/bgzip

.PHONY=all clean all_bams all_vcfs

all: all_vcfs


all_vcfs:  \
 $(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz


all_bams:  \
 $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam \
 $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam

#
# VCF for project 'Proj1'
# 
$(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz : $(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam) \
 $(addsuffix .fai,${REFERENCE}) ${samtools.exe}  ${bgzip.exe} ${tabix.exe} ${bcftools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} mpileup -uf ${REFERENCE} $(basename $(filter %.bai,$^)) | \
 ${bcftools.exe} view -vcg - > $(basename $@)  && \
 ${bgzip.exe} -f $(basename $@) && \
 ${tabix.exe} -f -p vcf $@
 
    



#
# index final BAM for Sample 'Sample1'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample1'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@



#
# merge BAMs 
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam :  \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
  ${samtools.exe} merge -f $@ $(filter %.bam,$^)
  



 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_01_R1.fastq.gz and test/fastq/sample_1_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_1_01_R1.fastq.gz  \
 test/fastq/sample_1_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20678948\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_02_R1.fastq.gz and test/fastq/sample_1_02_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam : \
 test/fastq/sample_1_02_R1.fastq.gz  \
 test/fastq/sample_1_02_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  ${REFERENCE} \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




#
# index final BAM for Sample 'Sample2'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample2'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_2_01_R1.fastq.gz and test/fastq/sample_2_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_2_01_R1.fastq.gz  \
 test/fastq/sample_2_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20681172\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




$(addsuffix .fai,${REFERENCE}): ${REFERENCE} ${samtools.exe}
 ${samtools.exe} faidx $<

$(addsuffix .bwt,${REFERENCE}): ${REFERENCE} ${bwa.exe}
 ${bwa.exe} index $<


${BINDIR}/bwa-0.7.10/bwa :
 rm -rf $(BINDIR)/bwa-0.7.10/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/bwa-0.7.10.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/bwa-0.7.10.tar.bz2 && \
 make -C $(dir $@)

${BINDIR}/samtools-0.1.19/bcftools/bcftools: ${BINDIR}/samtools-0.1.19/samtools

${BINDIR}/samtools-0.1.19/samtools  : 
 rm -rf $(BINDIR)/samtools-0.1.19/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/samtools-0.1.19.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/samtools-0.1.19.tar.bz2 && \
 make -C $(dir $@)


${BINDIR}/tabix-0.2.6/bgzip : ${BINDIR}/tabix-0.2.6/tabix


${BINDIR}/tabix-0.2.6/tabix  : 
 rm -rf $(BINDIR)/tabix-0.2.6/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/tabix-0.2.6.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/tabix-0.2.6.tar.bz2 && \
 make -C $(dir $@) tabix bgzip

clean:
 rm -rf ${BINDIR}

Drawing the Workflow:
The workflow is drawn with https://github.com/lindenb/makefile2graph.


Running Make:
And here is the output of make:
rm -rf /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/
bwa-0.7.10/
bwa-0.7.10/bamlite.c
(...)
gcc -g -Wall -Wno-unused-function -O2 -msse -msse2 -msse3 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS QSufSort.o bwt_gen.o bwase.o bwaseqio.o bwtgap.o bwtaln.o bamlite.o is.o bwtindex.o bwape.o kopen.o pemerge.o bwtsw2_core.o bwtsw2_main.o bwtsw2_aux.o bwt_lite.o bwtsw2_chain.o fastmap.o bwtsw2_pair.o main.o -o bwa -L. -lbwa -lm -lz -lpthread
make[1]: Entering directory `/home/lindenb/src/ngsxml'
/home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa index test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/
samtools-0.1.19/
samtools-0.1.19/.gitignore
samtools-0.1.19/AUTHORS
(...)
gcc -g -Wall -O2 -o bamcheck bamcheck.o -L.. -lm -lbam -lpthread -lz
make[3]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/misc'
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19'
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11671724\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  test/ref/ref.fa \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
  /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools merge -f OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11673828\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted.bam  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools faidx test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ tabix bgzip
tabix-0.2.6/
tabix-0.2.6/ChangeLog
tabix-0.2.6/Makefile
(...)
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6'
mkdir -p OUT/Projects/Proj1/VCF/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools mpileup -uf test/ref/ref.fa OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam | \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/bcftools/bcftools view -vcg - > OUT/Projects/Proj1/VCF/Proj1.vcf  && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/bgzip -f OUT/Projects/Proj1/VCF/Proj1.vcf && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/tabix -f -p vcf OUT/Projects/Proj1/VCF/Proj1.vcf.gz
make[1]: Leaving directory `/home/lindenb/src/ngsxml'
At the end, a VCF is generated
##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
chr4_gl000194_random 1973 . C G 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,3,0;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2462 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2492 . G C 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2504 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
That's it,

Pierre

30 September 2014

Using the Ensembl Regulatory Build to annotate some VCF files

via UCSC Genome Browser project announcements: "Data from the Ensembl Regulatory Build are now available in the UCSC Genome Browser as a public track hub for both hg19 and hg38. This track hub contains promoters and their flanking regions, enhancers, and many other regulatory features predicted across a number of cell lines using annotated segmentation states".
For example looking at chr21:33037019-33037021 returns the following screen:

Those new annotations are deployed by the Sanger Institute as a UCSC track hub. By the way, those file can be directly handled using the UCSC standalone tools:
$ bigWigSummary -type=mean -udcDir=.  \
  "http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/segmentation_summaries/Segway_17/1.bw" \
  chr1 1  110301 1

1.23587
I wrote a java tool for the annotation of VCFs with those files. This tool uses the BigWig library for java ( https://code.google.com/p/bigwig/ ) and is available at: https://github.com/lindenb/jvarkit/wiki/VcfEnsemblReg.
Here is an example with the following VCF:
##fileformat=VCFv4.1
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr21 33037029 . C T 6.20 . . GT:PL:DP:GQ 1/1:35,3,0:1:4
VcfEnsemblReg is invoked:
$  java -jar dist/vcfensemblreg.jar in.vcf > out.vcf
Here is the content of out.vcf:
##fileformat=VCFv4.1
##INFO=<ID=AP2ALPHA,Number=1,Type=Float,Description="Overlap summary of AP2ALPHA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2ALPHA.bw">
##INFO=<ID=AP2GAMMA,Number=1,Type=Float,Description="Overlap summary of AP2GAMMA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2GAMMA.bw">
##INFO=<ID=ATF3,Number=1,Type=Float,Description="Overlap summary of ATF3 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/ATF3.bw">
##INFO=<ID=BAF155,Number=1,Type=Float,Description="Overlap summary of BAF155 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF155.bw">
##INFO=<ID=BAF170,Number=1,Type=Float,Description="Overlap summary of BAF170 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF170.bw">
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr21 33037029 . C T 6.20 . BuildOverview=ctcf_45704|CTCFBindingSite;Segway_17_1=3.0;Segway_17_14=7.0;Segway_17_24=3.0;Segway_17_6=1.0;Segway_17_7=2.0;Segway_17_8=1.0;Segway_17_A549_projected=ctcf_45704|InactiveRegions;Segway_17_A549_segments=14_gene_79558|TranscriptionAssociated;Segway_17_DND41_projected=ctcf_45704|InactiveRegions;Segway_17_DND41_segments=1_distal_17115|DistalEnhancer;Segway_17_GM12878_projected=ctcf_45704|InactiveRegions;Segway_17_GM12878_segments=1_distal_29075|DistalEnhancer;Segway_17_H1HESC_projected=ctcf_45704|ActiveCTCFBindingSite;Segway_17_H1HESC_segments=8_ctcf_27831|DistalCTF;Segway_17_HELAS3_projected=ctcf_45704|InactiveRegions;Segway_17_HELAS3_segments=6_distal_76536|DistalEnhancer;Segway_17_HEPG2_projected=ctcf_45704|InactiveRegions;Segway_17_HEPG2_segments=1_distal_21535|DistalEnhancer;Segway_17_HMEC_projected=ctcf_45704|InactiveRegions;Segway_17_HMEC_segments=14_gene_44998|TranscriptionAssociated;Segway_17_HSMMT_projected=ctcf_45704|InactiveRegions;Segway_17_HSMMT_segments=24_gene_70780|TranscriptionAssociated;Segway_17_HSMM_projected=ctcf_45704|InactiveRegions;Segway_17_HSMM_segments=24_gene_80902|TranscriptionAssociated;Segway_17_HUVEC_projected=ctcf_45704|InactiveRegions;Segway_17_K562_projected=ctcf_45704|InactiveRegions;Segway_17_K562_segments=14_gene_68692|TranscriptionAssociated;Segway_17_MONO_projected=ctcf_45704|InactiveRegions;Segway_17_MONO_segments=14_gene_35200|TranscriptionAssociated;Segway_17_NHA_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_segments=14_gene_57366|TranscriptionAssociated;Segway_17_NHEK_projected=ctcf_45704|InactiveRegions;Segway_17_NHEK_segments=24_gene_95458|TranscriptionAssociated;Segway_17_NHLF_projected=ctcf_45704|InactiveRegions;Segway_17_NHLF_segments=14_gene_59524|TranscriptionAssociated;Segway_17_OSTEO_projected=ctcf_45704|InactiveRegions;Segway_17_OSTEO_segments=14_gene_61575|TranscriptionAssociated GT:PL:DP:GQ 1/1:35,3,0:1:4
Here are the new fields in the INFO column:
Segway_17_1 3.0
Segway_17_14 7.0
Segway_17_24 3.0
Segway_17_6 1.0
Segway_17_7 2.0
Segway_17_8 1.0
Segway_17_A549_projected ctcf_45704|InactiveRegions
Segway_17_A549_segments 14_gene_79558|TranscriptionAssociated
Segway_17_DND41_projected ctcf_45704|InactiveRegions
Segway_17_DND41_segments 1_distal_17115|DistalEnhancer
Segway_17_GM12878_projected ctcf_45704|InactiveRegions
Segway_17_GM12878_segments 1_distal_29075|DistalEnhancer
Segway_17_H1HESC_projected ctcf_45704|ActiveCTCFBindingSite
Segway_17_H1HESC_segments 8_ctcf_27831|DistalCTF
Segway_17_HELAS3_projected ctcf_45704|InactiveRegions
Segway_17_HELAS3_segments 6_distal_76536|DistalEnhancer
Segway_17_HEPG2_projected ctcf_45704|InactiveRegions
Segway_17_HEPG2_segments 1_distal_21535|DistalEnhancer
Segway_17_HMEC_projected ctcf_45704|InactiveRegions
Segway_17_HMEC_segments 14_gene_44998|TranscriptionAssociated
Segway_17_HSMMT_projected ctcf_45704|InactiveRegions
Segway_17_HSMMT_segments 24_gene_70780|TranscriptionAssociated
Segway_17_HSMM_projected ctcf_45704|InactiveRegions
Segway_17_HSMM_segments 24_gene_80902|TranscriptionAssociated
Segway_17_HUVEC_projected ctcf_45704|InactiveRegions
Segway_17_K562_projected ctcf_45704|InactiveRegions
Segway_17_K562_segments 14_gene_68692|TranscriptionAssociated
Segway_17_MONO_projected ctcf_45704|InactiveRegions
Segway_17_MONO_segments 14_gene_35200|TranscriptionAssociated
Segway_17_NHA_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_segments 14_gene_57366|TranscriptionAssociated
Segway_17_NHEK_projected ctcf_45704|InactiveRegions
Segway_17_NHEK_segments 24_gene_95458|TranscriptionAssociated
Segway_17_NHLF_projected ctcf_45704|InactiveRegions
Segway_17_NHLF_segments 14_gene_59524|TranscriptionAssociated
Segway_17_OSTEO_projected ctcf_45704|InactiveRegions
Segway_17_OSTEO_segments 14_gene_61575|TranscriptionAssociated

OK, now I've got a VCF containing those 'Ensembl Regulatory' annotations. What can I do with this ? I've currently no idea :-)

That's it,
Pierre

08 August 2014

A GNU-make plug-in for the #Illumina FASTQs.

The latest version of GNU-Make http://www.gnu.org/software/make/ provides many advanced capabilities, including many useful functions. However, it does not contain a complete programming language and so it has limitations. Sometimes these limitations can be overcome through use of the shell function to invoke a separate program, although this can be inefficient. On systems which support dynamically loadable objects, you can write your own extension in any language (which can be compiled into such an object) and load it to provide extended capabilities ( see http://www.gnu.org/software/make/manual/make.html#Loading-Objects )

Building a plug-in for the Illumina FASTQs.

from http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

For example, the following is a valid FASTQ file name:

NA10831_ATCACG_L002_R1_001.fastq.gz

Here I'm writing a set of new functions for makefile to extract the different parts (sample, lane...) of a fastq file-name:

The code is available on github.com at

First a struct holding the parts of the file is created:

enum E_IlluminaComponent
    {
    E_sampleName,
    E_barcodeSequence,
    E_lane,
    E_readNumber,
    E_setNumber
    };

typedef struct illumina_scheme_t
    {
    char* filename;
    char* components[NUM_ILLUMINA_COMPONENTS];
    } IlluminaScheme,*IlluminaSchemePtr ;

and a function parsing the filenames is created:

IlluminaSchemePtr IlluminaSchemeNew(const char* filename)
    {
    ...
    }

when the plugin llumina is loaded as a dynamic C library, the method llumina_gmk_setup is called,
and we tell make about the new functions with gmk_add_function(name,callback,min_args,max_args,no_expand_content) :

int illumina_gmk_setup ()
  {
   gmk_add_function ("illumina_sample",illumina_sample, 1, 1, 0);
   gmk_add_function ("illumina_lane",illumina_lane, 1, 1, 0);
   (...)
  }

A function registered with make must match the gmk_func_ptr type.
It will be invoked with three parameters: name (the name of the function), argc (the number of arguments to the function), and argv (an array of pointers to arguments to the function). The last pointer (that is, argv[argc]) will be null (0).
The return value of the function is the result of expanding the function.

char* illumina_sample(const char *function_name, unsigned int argc, char **argv)
    {
    /** extract the filename(s), build and return a string containing the samples */
    }

Compiling

the plugin must be compiled as a dynamic C library.

Note: The manual says this step can also be generated in the final 'Makefile' (via load ./illumina.so) but I was not able to compile a missing library (illumina.so cannot open shared object file: No such file or directory)

so I compiled it by hand:

gcc -Wall -I/path/to/sources/make-4.0 -shared -fPIC -o illumina.so illumina.c

Test

here is the makefile:

SAMPLES=  NA10831_ATCACG_L002_R1_001.fastq.gz \
      hello \
      NA10832_ATCACG_L002_R1_001.fastq.gz \
      NA10831_ATCACG_L002_R2_001.fastq.gz \
      NA10832_ATCACG_L002_R2_001.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      ERROR_ATCAGG_x003_R3_0z3.fastq.gz \
      false

all:
    @echo "SAMPLES: " $(illumina_sample  ${SAMPLES} )
    @echo "BARCODES: " $(illumina_barcode  ${SAMPLES} )
    @echo "LANE: " $(illumina_lane  ${SAMPLES} )
    @echo "READ: " $(illumina_read  ${SAMPLES} )
    @echo "SET: " $(illumina_set  ${SAMPLES} )

output:

$ make
SAMPLES:  NA10831 NA10832 NA10833
BARCODES:  ATCACG ATCAGG
LANE:  L002 L003
READ:  R1 R2
SET:  001 003

That's it,

Pierre