04 March 2016

Now in picard: two javascript-based tools filtering BAM and VCF files.


SamJS and VCFFilterJS are two tools I wrote for jvarkit. Both tools use the embedded java javascript engine to filter BAM or VCF file.
To get a broader audience, I've copied those functionalities to Picard in 'FilterSamReads' and 'FilterVcf'.

FilterSamReads

FilterSamReads filters a SAM or BAM file with a javascript expression using the java javascript-engine.
The script puts the following variables in the script context: 'record' a SamRecord and 'header' a SAMFileHeader. Last value of the script should be a boolean to tell wether we should accept or reject the record.

The script samFilter.js

/** accept record if second base of DNA is a A */
function accept(r)
 {
 return r.getReadString().length()>2 &&
  r.getReadString().substring(1,2)=="A";
 }

accept(record);

Invoke and output

$ java -jar picard-tools-2.1.1/picard.jar \
 FilterSamReads I=in.sam  O=out.sam \
 JS=samFilter.js FILTER=includeJavascript
 
$ cat out.sam | cut -f 10 | cut -c 2 | sort | uniq

A

FilterVcf

FilterVcf one or more hard filters to a VCF file to filter out genotypes and variants.
Filters a VCF file with a javascript expression interpreted by the java javascript engine. The script puts the following variables in the script context: 'variant' a VariantContext and 'header' a VCFHeader. Last value of the script should be a boolean to tell wether we should accept or reject the record.

The script variantFilter.js

/** prints a VARIATION if two samples at least have a DP>100 */ 
function myfilterFunction(thevariant)
    {
    var samples=header.genotypeSamples;
    var countOkDp=0;

    for(var i=0; i< samples.size();++i)
        {
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = thevariant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp > 100 ) countOkDp++;
        }
    return (countOkDp>2)
    }
myfilterFunction(variant)

Invoke and output

java -jar picard-tools-2.1.1/picard.jar FilterVcf \
 I=in.vcf O=out.vcf \
 JS=variantFilter.js
 
$ grep -v '#' jeter.vcf | cut -f 7 | grep variantFilter | wc -l
23


That's it,
Pierre

Reading a VCF file faster with java 8, htsjdk and java.util.stream.Stream

java 8 streams "support functional-style operations on streams of elements, such as map-reduce transformations on collections". In this post, I will show how I've implemented a java.util.stream.Stream of VCF variants that counts the number of items in dbsnp.

This example uses the java htsjdk API for reading variants.

When using parallel streams, the main idea is to implement a java.util.Spliterator that will split the sequence dictionary (the genome) into a maximum of N (here N=5) parts. Each part will count the number of variants in 1/N genome in its own thread. As we're using an tribble indexed VCF, it's easy to start counting at a given position of the genome.

ContigPos

the class ContigPos defines a chromosome and a position in the whole genome.

class ContigPos {
    /** contig/chromosome index in the dictionary */
    final int tid; 
    /** contig/chromosome name */
    final String contig;
    /** position in the chromosome */
    final int pos;
    (...)
    }

it contains a function to convert its' position to an index in the whole genome using the genome dictionary (htsjdk.samtools.SAMSequenceDictionary) .

long genomicIndex() {
    long n=0L;
    for(int i=0;i< this.tid;++i) {
        n += dict.getSequence(i).getSequenceLength();
        }
    return n + this.pos;
    }

VariantContextSpliterator

VariantContextSpliterator is the main class. It splits the VCF file into parts and implements Spliterator<VariantContext> .

public class VariantContextSpliterator
    implements Closeable,Spliterator<VariantContext> {
(...)

VariantContextSpliterator contains the sequence dictionary and the path to the indexed VCF file

/** current VCF File reader */
private VCFFileReader vcfFileReader = null;
/** genome dictionary */
private final SAMSequenceDictionary dict ;

Each VariantContextSpliterator has is own private VCFileReader and CloseableIterator. Both should be closed when the is no more variant to be read.

/** current VCF File reader */
private VCFFileReader vcfFileReader = null;
/** current variant iterator */
private CloseableIterator<VariantContext> variantIter = null;

Each VariantContextSpliterator has a dedicated genomic region.

/* region start */
private ContigPos begin;
/** region end */
private ContigPos end ;

The very first VariantContextSpliterator will scan :

  • from begin = new ContigPos("chr1",0)
  • to end = new ContigPos("chrY",(size_of_chrY))

We don't want to open to many threads, so we're tracking the number of opened iterators in a AtomicInteger

AtomicInteger nsplits

VariantContextSpliterator.peek()

VariantContextSpliterator.peek() is a method peeking the next Variant in the genomic interval.

We open the VCFFileReader if it was never opened, the number of opened files is incremented.

/** VCF reader was never opened */
if( this.vcfFileReader == null ) {
    /** open the VCF reader */
    this.vcfFileReader = new VCFFileReader(this.vcfFile, true);
    /** open a first iterator on the first chromosome */
    this.variantIter = this.vcfFileReader.query(
            this.begin.contig,
            this.begin.pos,
            this.dict.getSequence(this.begin.tid).getSequenceLength() /* whole chromosome size */
            );
    /** increase the number of opened streams */
    this.nsplits.incrementAndGet();
    } 

while there is no more variant available on this chromosome , open the next chromosome for reading:

while(!this.variantIter.hasNext()) {
    this.variantIter.close();
    this.variantIter = null;
    if(this.begin.tid == this.end.tid) /* this is the last chromosome */
        {
        close();
        return null;
        }
    else
        {
        this.begin = new ContigPos(this.begin.tid+1, 0);
        this.variantIter = this.vcfFileReader.query(
            this.begin.contig,
            this.begin.pos,
            this.dict.getSequence(this.begin.tid).getSequenceLength() /* whole chromosome size */
            );
        }
    }

get the next variant, update 'begin' with this variant. We close the VCFfileReader if we have reached the end of the genomic window.

/* get the next variant */
final VariantContext ctx = this.variantIter.next();
/* update 'begin' */
this.begin= new ContigPos(ctx.getContig(), ctx.getStart());

/** close if the end of the genomic location was reached */
if((this.begin.tid > this.end.tid) ||
   (this.begin.tid == this.end.tid && this.begin.pos >= this.end.pos) ) {
    close();
    return null;
    }
this._peeked = ctx;
return this._peeked;

VariantContextSpliterator.tryAdvance()

If a remaining variants exists, performs the given action on it, returning true; else returns false.

@Override
public boolean tryAdvance(Consumer<? super VariantContext> action) {
    final VariantContext ctx = this.next();
    if(ctx==null) {
        close();
        return false;
        }
    action.accept(ctx);
    return true;
    }

VariantContextSpliterator.trySplit()

trySplit returns a VariantContextSpliterator covering elements, that will, upon return from this method, not be covered by this VariantContextSpliterator. We can split if the remaining window size is greater than 1Mb and if the number of opened VCFReaderFile is lower than 10.

public Spliterator<VariantContext> trySplit() {
    final VariantContext ctx = this.peek();
    /** no more variant to read, can't split */
    if(ctx==null) return null;
    /** too many opened VCFFile reader, can't split */
    if( this.nsplits.get()>5) return null;

    long start = this.begin.genomicIndex();
    long distance = this.end.genomicIndex() - start;

    /** distance between begin and end is greater than 1Mb */
    while(distance > 1E6 )
        {
        distance = distance/2;
        /** middle genomic index */
        final ContigPos mid = new ContigPos(start + distance);
        
        /** create a new VariantContextSpliterator starting from mid and ending at this.end */
        final VariantContextSpliterator next = new VariantContextSpliterator(this,mid,this.end);
        if(next.peek()==null) {
            next.close();
            continue;
            }
        /* update this.end to 'mid' */
        this.end= mid;
        //System.err.println("split successful:after split "+toString()+" and next="+next);
        return next;
        }

    return null;
    }

Testing

to get a stream , we the static function java.util.stream.StreamSupport.stream is called.

stream() Creates a new sequential or parallel Stream from a Spliterator. The spliterator is only traversed, split, or queried for estimated size after the terminal operation of the stream pipeline commences.

private Stream<VariantContext> stream(boolean parallel) {
    return StreamSupport.stream(new VariantContextSpliterator(this.vcfFile), parallel);
    }

We count the number of variants in dbSNP. We print the duration for stream(), parallelStream() and a standard iterator.

final File vcFile =new File(args[0]);
StreameableVcfFile test= new StreameableVcfFile(vcFile);
long start1 = System.currentTimeMillis();
System.out.println("count;"+test.parallelStream().count());
long end1 = System.currentTimeMillis();
System.out.println(" parallelstream: " + ((end1 - start1) / 1000));



long start2 = System.currentTimeMillis();
System.out.println("count;"+test.stream().count());
long end2 = System.currentTimeMillis();
System.out.println("stream : " + ((end2 - start2) / 1000));


long start3 = System.currentTimeMillis();
CloseableIterator<VariantContext>  r= new VCFFileReader(vcFile).iterator();
int n=0;
while(r.hasNext()) { r.next(); ++n;}
r.close();
long end3 = System.currentTimeMillis();
 System.out.println("count;"+n);
System.out.println("iter : " + ((end3 - start3) / 1000));

Output:

count: 61045456 snps
parallelstream: 80 seconds

count: 61045456 snps
stream : 365 seconds

count: 61045456 snps
iter : 355 seconds

That's it,

Pierre

Source code

24 February 2016

Registering a tool in the @ELIXIREurope regisry using XML, XSLT, JSON and curl. My notebook.

The Elixir Registry / pmid:26538599 "A portal to bioinformatics resources world-wide. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools."
In this post, I will describe how I've used the bio.tools API to register some tools from jvarkit.

Authenticate with your credentials

using curl, the 'bio.tools' service returns a authentication token.

$ curl -s \
 -H "Content-type: application/json" \
 -X POST \
 -d '{"username":"my-login@univ-nantes.fr","password":"password1234"}' \
 https://bio.tools/api/auth/login |\
 python -m json.tool
{
    "token": "74dedea023dbad8ecda49ac57bb1074acd794f"
}

Creating a JSON describing the tool.

The tool I'm goind to use is VCFhead. A very simple tool printing the first variants of a VCF file. In jvarkit I don't write the code parsing the arguments, everything is described using a XML file that is going to be processed with a XSTL stylesheet to generate an abstract java code handling the options, etc....

xsltproc command2java VcfHead.xml > AbstractVcfHead.java

For VcfHead the XML descriptor is available here: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/VcfHead.xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<app xmlns="http://github.com/lindenb/jvarkit/" xmlns:h="http://www.w3.org/1999/xhtml" app="VcfHead" package="com.github.lindenb.jvarkit.tools.misc" >
  <description>Print the first variants of a VCF.</description>
  <input type="vcf"/>
  <output type="vcf"/>
  <options>
    <option name="count" type="int" opt="n" longopt="count" min-inclusive="0" default="10">
      <description>number of variants to be printed</description>
    </option>
  </options>
  <documentation>
    <h:h3>Example</h:h3>
   (...)
  </documentation>
</app>

Using a first XSLT stylesheet https://github.com/lindenb/jvarkit/blob/master/src/main/resources/xsl/jsonxelixir.xsl, 'VcfHead.xml' is firstly converted to the 'infamous' JSONx (JSON+XML) format .
xsltproc jsonxelixir VcfHead.xml > VcfHead.jsonx
The JSONx file:
<?xml version="1.0"?>
<jsonx:object xmlns:jsonx="http://www.ibm.com/xmlns/prod/2009/jsonx" xmlns:c="http://github.com/lindenb/jvarkit/" xmlns="http://www.w3.org/1999/xhtml" xmlns:x="http://www.ibm.com/xmlns/prod/2009/jsonx">
  <jsonx:string name="accessibility">Public</jsonx:string>
  <jsonx:string name="affiliation">univ-nantes.fr</jsonx:string>
  <jsonx:string name="cost">Free</jsonx:string>
  <jsonx:array name="platform">
    <jsonx:string>Linux</jsonx:string>
    <jsonx:string>Mac</jsonx:string>
  </jsonx:array>
  <jsonx:string name="version">1.0</jsonx:string>
  <jsonx:string name="homepage">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
  <jsonx:array name="function">
    <jsonx:object>
      <jsonx:array name="input">
        <jsonx:object>
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          </jsonx:object>
          <jsonx:array name="dataFormat">
            <jsonx:object>
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
            </jsonx:object>
          </jsonx:array>
        </jsonx:object>
      </jsonx:array>
      <jsonx:array name="output">
        <jsonx:object>
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          </jsonx:object>
          <jsonx:string name="dataDescription">any format</jsonx:string>
          <jsonx:array name="dataFormat">
            <jsonx:object>
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
            </jsonx:object>
          </jsonx:array>
        </jsonx:object>
      </jsonx:array>
      <jsonx:array name="functionName">
        <jsonx:object>
          <jsonx:string name="term">Formatting</jsonx:string>
          <jsonx:string name="uri">http://edamontology.org/operation_0335</jsonx:string>
        </jsonx:object>
      </jsonx:array>
      <jsonx:string name="functionDescription">Print the first variants of a VCF.</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="description">Print the first variants of a VCF.</jsonx:string>
  <jsonx:object name="docs">
    <jsonx:string name="docsTermsOfUse">https://opensource.org/licenses/MIT</jsonx:string>
    <jsonx:string name="docsGithub">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsHome">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsCitationInstructions">https://github.com/lindenb/jvarkit/wiki/Citing</jsonx:string>
    <jsonx:string name="docsDownloadSource">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
    <jsonx:string name="docsDownload">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
  </jsonx:object>
  <jsonx:array name="collection">
    <jsonx:string>jvarkit</jsonx:string>
  </jsonx:array>
  <jsonx:object name="credits">
    <jsonx:array name="creditsInstitution">
      <jsonx:string>Institut du Thorax, Nantes, France</jsonx:string>
    </jsonx:array>
    <jsonx:array name="creditsDeveloper">
      <jsonx:string>Pierre Lindenbaum</jsonx:string>
    </jsonx:array>
  </jsonx:object>
  <jsonx:array name="interface">
    <jsonx:object>
      <jsonx:string name="interfaceType">Command line</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="name">VcfHead</jsonx:string>
  <jsonx:array name="topic">
    <jsonx:object>
      <jsonx:string name="term">Omics</jsonx:string>
      <jsonx:string name="uri">http://edamontology.org/topic_3391</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="license">MIT License</jsonx:string>
  <jsonx:array name="language">
    <jsonx:string>Java</jsonx:string>
  </jsonx:array>
  <jsonx:array name="resourceType">
    <jsonx:string>Tool</jsonx:string>
  </jsonx:array>
  <jsonx:string name="maturity">Stable</jsonx:string>
  <jsonx:array name="contact">
    <jsonx:object>
      <jsonx:string name="contactURL">https://github.com/lindenb</jsonx:string>
      <jsonx:string name="contactName">Pierre Lindenbaum</jsonx:string>
      <jsonx:array name="contactRole">
        <jsonx:string>Developer</jsonx:string>
        <jsonx:string>Maintainer</jsonx:string>
        <jsonx:string>Helpdesk</jsonx:string>
      </jsonx:array>
    </jsonx:object>
  </jsonx:array>
  <jsonx:object name="publications">
    <jsonx:string name="publicationsPrimaryID">doi:10.6084/m9.figshare.1425030.v1</jsonx:string>
  </jsonx:object>
</jsonx:object>

Using another XSLT stylesheet jsonx2json.xsl, the JSONx is converted to a JSON file.
xsltproc jsonx2json.xsl VcfHead.jsonx > VcfHead.json
the JSON file:
{
    "accessibility": "Public", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
        "jvarkit"
    ], 
    "contact": [
        {
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
                "Developer", 
                "Maintainer", 
                "Helpdesk"
            ], 
            "contactURL": "https://github.com/lindenb"
        }
    ], 
    "cost": "Free", 
    "credits": {
        "creditsDeveloper": [
            "Pierre Lindenbaum"
        ], 
        "creditsInstitution": [
            "Institut du Thorax, Nantes, France"
        ]
    }, 
    "description": "Print the first variants of a VCF.", 
    "docs": {
        "docsCitationInstructions": "https://github.com/lindenb/jvarkit/wiki/Citing", 
        "docsDownload": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsDownloadSource": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsGithub": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsHome": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsTermsOfUse": "https://opensource.org/licenses/MIT"
    }, 
    "function": [
        {
            "functionDescription": "Print the first variants of a VCF.", 
            "functionName": [
                {
                    "term": "Formatting", 
                    "uri": "http://edamontology.org/operation_0335"
                }
            ], 
            "input": [
                {
                    "dataFormat": [
                        {
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                        }
                    ], 
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
                    }
                }
            ], 
            "output": [
                {
                    "dataDescription": "any format", 
                    "dataFormat": [
                        {
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                        }
                    ], 
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
                    }
                }
            ]
        }
    ], 
    "homepage": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
    "interface": [
        {
            "interfaceType": "Command line"
        }
    ], 
    "language": [
        "Java"
    ], 
    "license": "MIT License", 
    "maturity": "Stable", 
    "name": "VcfHead", 
    "platform": [
        "Linux", 
        "Mac"
    ], 
    "publications": {
        "publicationsPrimaryID": "doi:10.6084/m9.figshare.1425030.v1"
    }, 
    "resourceType": [
        "Tool"
    ], 
    "topic": [
        {
            "term": "Omics", 
            "uri": "http://edamontology.org/topic_3391"
        }
    ], 
    "version": "1.0"
}

Registering the tool

Now we have the Token and the json descriptor we can add VcfHead to Elixir using curl:
curl  -H "Content-type: application/json" \
 -H "Authorization: Token 74dedea023dbad8ecda49ac57bb1074acd794f" \
 -X POST \
 -d  @path/to/VcfHead.json \
 "https://bio.tools/api/tool" |\
 python -m json.tool
output:
{
    "accessibility": "Public", 
    "additionDate": "2016-02-24T11:37:17.458Z", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
        "jvarkit"
    ], 
    "contact": [
        {
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
                "Developer", 
                "Maintainer", 
                "Helpdesk"
            ], 
            "contactURL": "https://github.com/lindenb"
(...)

VCfhead is now visible from the Elixir Registry at https://bio.tools/tool/univ-nantes.fr/VcfHead/1.0.
http://i.imgur.com/PjEMjX6.jpg

That's it,
Pierre.

05 December 2015

Happy birthday my blog. You are now ten-year-old.

Happy birthday my blog. You are now 10-year-old.




03 December 2015

GATK-UI : a java-swing interface for the Genome Analysis Toolkit.

I've just pushed GATK-UI, a java swing interface for the Genome Analysis Toolkit GATK at https://github.com/lindenb/gatk-ui. This tool is also available as a WebStart/JNLP application.

Screenshot


Why did you create this tool ?

Some non-bioinformatician collaborators often want some coverage data for a defined set of BAM, for a specific region...

Did you test every tool ?

NO

How did you create an interface for each GATK tool ?

Each tool in the GATK is documented in a web page: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php and
each web page is associated to a structured JSON page https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php.json


{
  "summary": "Select a subset of variants from a larger callset",
  "parallel": [
    {
      "arg": "-nt",
      "link": "http://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#-nt",
      "name": "TreeReducible"
    }
  ],
  "activeregion": {},

This json is transformed to XML in order to process it with XSLT . A XSLT stylesheet generates some java code


That's it,
Pierre

13 July 2015

Playing with #Docker , my notebook

This post is my notebook about docker after we had a very nice introduction about docker by François Moreews (INRIA/IRISA, Rennes). I've used docker today for the first time, my aim was just to create an image containing https://github.com/lindenb/verticalize, a small tool I wrote to verticalize text files.

Install docker

you hate running this kind of command-lines, aren't you ?
$ wget -qO- https://get.docker.com/ | sh
sudo password for tatayoyo:
apparmor is enabled in the kernel and apparmor utils were already installed
/sbin/apparmor_parser
+ [ https://get.docker.com/ = https://get.docker.com/ ]
+ sudo -E sh -c apt-key adv --keyserver
(...)

add my linux account to the "docker" group

sudo usermod -aG docker tatayoyo
logout and relog...

I'm working behind a $@!# proxy: edit /etc/default/docker to set the proxy-things

$ cat /etc/default/docker
(...)
# If you need Docker to use an HTTP proxy, it can also be specified
here.
export http_proxy="http://(proxy-port):(proxy-host)/"
export https_proxy="http://(proxy-port):(proxy-host)/"
export ftp_proxy="http://(proxy-port):(proxy-host)/"
export HTTP_PROXY="http://(proxy-port):(proxy-host)/"
export FTP_PROXY="http://(proxy-port):(proxy-host)/"
export HTTPS_PROXY="http://(proxy-port):(proxy-host)/"
(...)

start the docker service

$ sudo start docker
[sudo] password for tatayoyo:
docker start/running, process 5023

create the Dockerfile

Create a new directory, in this directory we create a file named "Dockerfile". It contains
  • the name of the base-image we're using (here the latest ubuntu)
  • the $@!# proxy settings (again ???!!!!)
  • some calls to `apt` to download git, gcc, make ...
  • some statements to clone https://github.com/lindenb/verticalize, compile and install my tool into /usr/local/bin
FROM ubuntu:latest
ENV http_proxy "http://(proxy-port):(proxy-host)/"
ENV https_proxy "http://(proxy-port):(proxy-host)/"
ENV ftp_proxy "http://(proxy-port):(proxy-host)/"
ENV HTTP_PROXY "http://(proxy-port):(proxy-host)/"
ENV HTTPS_PROXY "http://(proxy-port):(proxy-host)/"
ENV FTP_PROXY "http://(proxy-port):(proxy-host)/"
RUN apt-get update
RUN apt-get install -y wget gcc make git
RUN git clone "https://github.com/lindenb/verticalize.git" /tmp/verticalize && make -C /tmp/verticalize && cp /tmp/verticalize/verticalize /usr/local/bin && rm -rf /tmp/verticalize

create the image 'verticalize' from the Dockerfile

sudo docker build -t verticalize .
(...)

List the images

$ docker images
REPOSITORY  TAG                 IMAGE ID         CREATED        VIRTUAL SIZE
verticalize latest              5f7159b4921a     12 seconds ago 317 MB
(...)

Tag the 'verticalize' image as hosted on my dockerhub repo https://registry.hub.docker.com/u/lindenb

$ docker tag 5f7159b4921a lindenb/verticalize:latest

$ docker images
REPOSITORY            TAG                 IMAGE ID            CREATED              VIRTUAL SIZE
verticalize           latest              5f7159b4921a        About a minute ago   317 MB
lindenb/verticalize   latest              5f7159b4921a        About a minute ago   317 MB

Push the image to dockerhub

$ docker push lindenb/verticalize

The push refers to a repository [lindenb/verticalize] (len: 1)
5f7159b4921a: Image push failed

Please login prior to push:
Username: lindenb
Password:
Email: xxxxxxx@yahoo.fr
WARNING: login credentials saved in /home/tatyoyo/.docker/config.json
Login Succeeded
The push refers to a repository [lindenb/verticalize] (len: 1)
5f7159b4921a: Image already exists
68f6ddc7de15: Buffering to Disk

We can now remove the local image ...

$ docker rmi -f  5f7159b4921a

.. and pull the image from dockerhub

$ docker pull lindenb/verticalize
latest: Pulling from lindenb/verticalize
83e4dde6b9cf: Downloading [==================> ] 24.82 MB/65.79 MB
b670fb0c7ecd: Download complete
29460ac93442: Download complete
d2a0ecffe6fa: Download complete
48e98a1c03ae: Download complete
94ac1beb0514: Download complete
e12eda8693a9: Download complete
5eb1952afbb7: Download complete
fb4ac6e6a264: Download complete
0f8372bacf03: Download complete
789c4f122778: Downloading [=================>  ] 7.511 MB/20.92 MB
68f6ddc7de15: Downloading [=====>              ]  4.99 MB/44.61 MB
5f7159b4921a: Download complete

At the end, run a command inside the docker container

My tool verticalize is installed in the image 'lindenb/verticalize:latest' :
$ cat << EOF |  docker run -i lindenb/verticalize:latest
> echo -e "#X\tY\n1\t2\n3\t4" | verticalize
> EOF

>>> 2
$1 #X 1
$2 Y 2
<<< 2

>>> 3
$1 #X 3
$2 Y 4
<<< 3

That's it,
Pierre

29 June 2015

A BLAST to SAM converter.

Some times ago, I've received a set of Ion-Torrent /mate-reads with a poor quality. I wasn't able to align much things using bwa. I've always wondered if I could get better alignments using NCBI-BLASTN (short answer: no) . That's why I asked guyduche, my intership student to write a C program to convert the output of blastn to SAM. His code is available on github at :

The input for blast2sam is
  • the XML output of NCBI blastn (or stdin)
  • The single or pair of fastq file(s)
  • The reference sequence indexed with picard
.

Example:

fastq2fasta in.R1.fq.gz in.R2.fq.gz |\
blastn -db REFERENCE   -outfmt 5 | \
blast2bam -o result.bam -W 40 -R '@RG   ID:foo  SM:sample' - REFERENCE.dict  in.R1.fq.gz in.R2.fq.gz

Output:
@SQ SN:gi|9629357|ref|NC_001802.1|  LN:9181 
@RG ID:foo  SM:sample
@PG ID:Blast2Bam    PN:Blast2Bam    VN:0.1  CL:../../bin/blast2bam -o results.sam -W 40 -R @RG  ID:foo  SM:sample - db.dict test_1.fastq.gz test_2.fastq.gz
(...)
ERR656485.2 83  gi|9629357|ref|NC_001802.1| 715 60  180S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X6=1S =   715 -119    CCTAGTGTTGCTTGCTTTTCTTCTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAN    (),.((((,(((((,((.((.-(>69>20E>6/=>5EC@9-52?BEE::2951.)74B64=B==FFAF=A??59:>FFFDF:55GGFGF?DFGGFE868>GGGFGGGGED;FGFFGGGGGGGGGGGEFFGE9GGGGFGGGGGGGGDGECGGFGGGGGGGGGGFGGGGGEGGFGGGGGGFFGGGGGFF?EGGFFFEGGGGGGGGFEGGGEGGGFEGGGGGGGGGGDGFFCEGFGGGGGGGGGGGFFECFGGGGFGGGGGGGGGGGFCGGGGGGGGGGGGGGGGGGFGGGGGGGGF@CCA8!    NM:i:13 RG:Z:foo    AS:i:80 XB:f:148.852    XE:Z:4.07e-39
ERR656485.2 163 gi|9629357|ref|NC_001802.1| 715 60  73S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X8=106S    =   715 119 NAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTCTCATTACAAAAAAAACATACACAATAAATGATATAAGCGGAATCAACAGCATGA    !8A@CGGEFGFGCDFGGGGGGGGGGGGGGFGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGFGGGGGGGGGEGFGFGGGFFGGGGGGGGFGGGGGGGGGGGGFFFFGGGGGG=FFGGFFDGGGGGGGG8FGFGGGGGGGGGFGGGGGGGGGGFDGGFGGFGGGFFFGFF8DFDFDFFFFFFFFFBCDB<@EAFB@ABAC@CDFF?4>EEFE<*>BDAFB@FFBFF>((6<5CC.;C;=D9106(.))).)-46<<))))))))))((,(-)))()((()))    NM:i:13 RG:Z:foo    AS:i:82 XB:f:152.546    XE:Z:3.15e-40
(...)
Now, I would be interested in finding another dataset where this tool could be successfully used.
That's it,
Pierre

18 June 2015

Playing with the #GA4GH schemas and #Avro : my notebook

After watching David Haussler's talk "Beacon Project and Data Sharing ApIs", I wanted to play with Avro and the models and APIs defined by the Global Alliance for Genomics and Health (ga4gh) coalition Here is my notebook.
(Wikipedia) Avro: "Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services."
First, we download the java tools and libraries for apache Avro
curl -L -o avro-tools-1.7.7.jar "http://www.eng.lsu.edu/mirrors/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar"
Next, we download the schemas defined by the ga4gh from github
curl -L -o schema.zip "https://github.com/ga4gh/schemas/archive/v0.5.1.zip"
unzip schema.zip
rm schema.zip

$ find -name "*.avdl"
./schemas-0.5.1/src/main/resources/avro/readmethods.avdl
./schemas-0.5.1/src/main/resources/avro/common.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadata.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadatamethods.avdl
./schemas-0.5.1/src/main/resources/avro/wip/variationReference.avdl
./schemas-0.5.1/src/main/resources/avro/variants.avdl
./schemas-0.5.1/src/main/resources/avro/variantmethods.avdl
./schemas-0.5.1/src/main/resources/avro/beacon.avdl
./schemas-0.5.1/src/main/resources/avro/references.avdl
./schemas-0.5.1/src/main/resources/avro/referencemethods.avdl
./schemas-0.5.1/src/main/resources/avro/reads.avdl
Those schema can be compiled to java using the avro-tools
$ java -jar avro-tools-1.7.7.jar compile protocol schemas-0.5.1/src/main/resources/avro/ ./generated
Input files to compile:
  schemas-0.5.1/src/main/resources/avro/variants.avpr
  
$ find generated/org/ -name "*.java"
generated/org/ga4gh/GAPosition.java
generated/org/ga4gh/GAVariantSetMetadata.java
generated/org/ga4gh/GACall.java
generated/org/ga4gh/GAException.java
generated/org/ga4gh/GACigarOperation.java
generated/org/ga4gh/GAVariantSet.java
generated/org/ga4gh/GAVariants.java
generated/org/ga4gh/GAVariant.java
generated/org/ga4gh/GACallSet.java
generated/org/ga4gh/GACigarUnit.java
As a test, the following java source uses the classes generated by avro to create nine variants and serialize them to Avro

Compile, archive and execute:
#compile classes
javac -d generated -cp avro-tools-1.7.7.jar -sourcepath generated:src generated/org/ga4gh/*.java src/test/TestAvro.java
# archive
jar cvf generated/ga4gh.jar -C generated org -C generated test
# run
java -cp avro-tools-1.7.7.jar:generated/ga4gh.jar test.TestAvro > variant.avro
We use the avro-tools to convert the generated file variant.avro to json
java -jar avro-tools-1.7.7.jar tojson variant.avro

Output:

The complete Makefile



That's it,
Pierre

07 May 2015

Monitoring a java application with mbeans. An example with samtools/htsjdk.

"A MBean is a Java object that follows the JMX specification. A MBean can represent a device, an application, or any resource that needs to be managed. The JConsole graphical user interface is a monitoring tool that complies to the JMX specification.". In this post I'll show how I've modified the sources of the htsjdk library to monitor the java program reading a VCF file from the Exac server. See my commit at https://github.com/lindenb/htsjdk/commit/3c1ac1a18917aaa69f8dc49c70fd893a6a0542c3.

First, we define a java class ProgressLoggerMBean to tell java about the informations that will be forwarded to the jconsole: the number of records processed, the elapsed time, etc...

package htsjdk.samtools.util;
public interface ProgressLoggerMBean
    {
    /* the noun to use when logging, e.g. "Records, Variants, Loci" */
    public String getNoun();
    /* verb the verb to log, e.g. "Processed, Read, Written" */
    public String getVerb();
    /** Returns the count of records processed. */
    public long getCount();
    /** elapsed time */
    public String getElapsedTime();
    /** last record */
    public String getLastRecord();
    }
The already existing htsjdk class htsjdk.samtools.util.ProgressLogger is modified: it now implements ProgressLoggerMBean:
public class ProgressLogger
   implements ProgressLoggerInterface, Closeable, ProgressLoggerMBean
The methods are implemented:
(...)
     @Override
    public String getElapsedTime(){
        return this.formatElapseTime(this.getElapsedSeconds());
    }
    
    @Override
    public String getLastRecord(){
        return this.lastRecord;
    }
In the constructor we try to connect to the MBean server that has been created and initialized by the platform. The ProgressLogger is wrapped into an ObjectName and inserted in the MBean server:
MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
/* defines an object name for the MBean instance that it will create */
this.objectMBean = new ObjectName("htsjdk.samtools.util:type="+noun);
mbs.registerMBean(this, this.objectMBean);

A 'close' method is used to unregister the object from the MBean server:
@Override
public void close() {
    if(this.objectMBean!=null) {
        try {
            MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
            mbs.unregisterMBean(this.objectMBean);
        } catch(Exception err) {
        //ignore   
        } finally {
        this.objectMBean=null;
        }
    }
Here is an example. This program uses the htsjdk library to parse a VCF file:
import htsjdk.variant.vcf.*;
import htsjdk.variant.variantcontext.*;
import htsjdk.tribble.readers.*;
import htsjdk.samtools.util.*;

public class TestProgress
 {
 private final static Log log = Log.getInstance(TestProgress.class);
 public static void main(String args[]) throws Exception
  {
  ProgressLoggerInterface progress = new ProgressLogger(log, 1000, "Read VCF");
  VCFCodec codec= new VCFCodec();
  LineReader r= LineReaderUtil.fromBufferedStream(System.in);
  LineIteratorImpl t= new LineIteratorImpl(r);
  codec.readActualHeader(t);
  while(t.hasNext())
   {
   VariantContext ctx = codec.decode(t.next());
   progress.record(ctx.getContig(),ctx.getStart());
   }
  r.close();
  progress.close();
  }
 }
Compile and execute to download Exac:
javac -cp dist/htsjdk-1.130.jar:dist/snappy-java-1.0.3-rc3.jar:dist/commons-jexl-2.1.1.jar:dist/commons-logging-1.1.1.jar TestProgress.java && \
curl -s  "ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3/ExAC.r0.3.sites.vep.vcf.gz" |\
gunzip -c  |\
java -cp dist/htsjdk-1.130.jar:dist/snappy-java-1.0.3-rc3.jar:dist/commons-jexl-2.1.1.jar:dist/commons-logging-1.1.1.jar:. TestProgress

(...)
INFO 2015-05-07 21:07:02 TestProgress Read VCF       675,000 records.  Elapsed time: 00:03:33s.  Time for last 1,000:    0s.  Last read position: 1:168,035,033
INFO 2015-05-07 21:07:03 TestProgress Read VCF       676,000 records.  Elapsed time: 00:03:33s.  Time for last 1,000:    0s.  Last read position: 1:168,216,140
INFO 2015-05-07 21:07:03 TestProgress Read VCF       677,000 records.  Elapsed time: 00:03:34s.  Time for last 1,000:    0s.  Last read position: 1:169,076,058
INFO 2015-05-07 21:07:03 TestProgress Read VCF       678,000 records.  Elapsed time: 00:03:34s.  Time for last 1,000:    0s.  Last read position: 1:169,366,434
INFO 2015-05-07 21:07:03 TestProgress Read VCF       679,000 records.  Elapsed time: 00:03:34s.  Time for last 1,000:    0s.  Last read position: 1:169,500,081
(...)

The progression can now be monitored in the jconsole:


That's it.
Pierre

05 May 2015

Playing with hadoop/mapreduce and htsjdk/VCF : my notebook.

The aim of this test is to get a count of each type of variant/genotypes in a VCF file using Apache Hadoop and the java library for NGS htsjdk. My source code is available at: https://github.com/lindenb/hadoop-sandbox/blob/master/src/main/java/com/github/lindenb/hadoop/Test.java.

First, and this is my main problem, I needed to create a class 'VcfRow' that would contains the whole data about a variant. As I need to keep the information about all the semantics in the VCF header, each record contains the whole VCF header (!). I asked SO if there was an elegant way to save the header in the hadoop workflow but it currently seems that there is no such solution (http://stackoverflow.com/questions/30052859/hadoop-mapreduce-handling-a-text-file-with-a-header). This class
VcfRow must implement WritableComparable to be serialized by the hadoop pipeline. It's awfully sloooooow since we need to parse a htsjdk.variant.vcf.VCFHeader and a htsjdk.variant.vcf.VCFCodec for each new variant.

public static class VcfRow
implements WritableComparable<VcfRow>
 {
 private List<String> headerLines;
 private String line;
 private VariantContext ctx=null;
 private VCFHeader header =null;
 private VCFCodec codec=new VCFCodec();
 public VcfRow()
   {
 this.headerLines = Collections.emptyList();
 this.line="";
   }
 public VcfRow(List<String> headerLines,String line)
  {
 this.headerLines=headerLines; 
 this.line=line;
  }
 
@Override
public void write(DataOutput out) throws IOException
 {
 out.writeInt(this.headerLines.size());
 for(int i=0;i< this.headerLines.size();++i)
  {
  out.writeUTF(this.headerLines.get(i));
  }
 byte array[]=line.getBytes();
 out.writeInt(array.length);
 out.write(array);
 }

@Override
public void readFields(DataInput in) throws IOException
 {
 int n= in.readInt();
 this.headerLines=new ArrayList<String>(n);
 for(int i=0;i<n;++i) this.headerLines.add(in.readUTF());
 n = in.readInt();
 byte array[]=new byte[n];
 in.readFully(array);
 this.line=new String(array);
 this.codec=new VCFCodec();
 this.ctx=null;
 this.header=null;
 }

public VCFHeader getHeader()
 {
 if(this.header==null)
  {
  this.header = (VCFHeader)this.codec.readActualHeader(new MyLineIterator());
  }
 return this.header;
 }

public VariantContext getVariantContext()
 {
 if(this.ctx==null)
  {
  if(this.header==null) getHeader();//force decode header
  this.ctx=this.codec.decode(this.line);
  }
 return this.ctx;
 }

@Override
public int compareTo(VcfRow o)
 {
 int i = this.getVariantContext().getContig().compareTo(o.getVariantContext().getContig());
 if(i!=0) return i;
 i = this.getVariantContext().getStart() - o.getVariantContext().getStart();
 if(i!=0) return i;
 i =  this.getVariantContext().getReference().compareTo( o.getVariantContext().getReference());
 if(i!=0) return i;
 return this.line.compareTo(o.line);
 }

   private  class MyLineIterator
 extends AbstractIterator<String>
 implements LineIterator
 { 
 int index=0;
 @Override
 protected String advance()
  {
  if(index>= headerLines.size()) return null;
  return headerLines.get(index++);
  }
 }
}

Then a special InputFormat is created for the VCF format. As we need to keep a trace of the Header, this file declares `isSplitable==false`. The class VcfInputFormat creates an instance of RecordReader reading the whole VCF header the first time it is invoked with the method `initialize`. This 'VcfRecordReader' creates a new VcfRow for each line.

public static class VcfInputFormat extends FileInputFormat<LongWritable, VcfRow>
   {
 private List<String> headerLines=new ArrayList<String>();
 
 @Override
 public RecordReader<LongWritable, VcfRow> createRecordReader(InputSplit split,
   TaskAttemptContext context) throws IOException,
   InterruptedException {
  return new VcfRecordReader();
  }  
 @Override
 protected boolean isSplitable(JobContext context, Path filename) {
  return false;
  }
  
 //LineRecordReader
  private class VcfRecordReader extends RecordReader<LongWritable, VcfRow>
    {
  private LineRecordReader delegate=new LineRecordReader();
  public VcfRecordReader() throws IOException
    {
    }
  
   @Override
  public void initialize(InputSplit genericSplit,
    TaskAttemptContext context) throws IOException {
    delegate.initialize(genericSplit, context);
   while( delegate.nextKeyValue())
    {
    String row = delegate.getCurrentValue().toString();
    if(!row.startsWith("#")) throw new IOException("Bad VCF header");
    headerLines.add(row);
    if(row.startsWith("#CHROM")) break;
    }
    }
   @Override
  public LongWritable getCurrentKey() throws IOException,
    InterruptedException {
   return delegate.getCurrentKey();
    }
   
   @Override
  public VcfRow getCurrentValue() throws IOException,
    InterruptedException {
   Text row = this.delegate.getCurrentValue();
   return new VcfRow(headerLines,row.toString());
    }
   
   @Override
  public float getProgress() throws IOException, InterruptedException {
   return this.delegate.getProgress();
    }
   
   @Override
  public boolean nextKeyValue() throws IOException,
    InterruptedException {
   return this.delegate.nextKeyValue();
    }
   
   @Override
  public void close() throws IOException {
    delegate.close();
   }
     }
   }

The hadoop mapper uses the information of each VCFrow and produce a count of each category:
public static class VariantMapper
   extends Mapper<LongWritable, VcfRow, Text, IntWritable>{

 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, VcfRow vcfRow, Context context ) throws IOException, InterruptedException {
  VariantContext ctx = vcfRow.getVariantContext();
  if( ctx.isIndel())
   { 
   word.set("ctx_indel");
      context.write(word, one);
   }
  if( ctx.isBiallelic())
   { 
   word.set("ctx_biallelic");
      context.write(word, one);
   }
  if( ctx.isSNP())
   { 
   word.set("ctx_snp");
   context.write(word, one);
   } 
  if( ctx.hasID())
   { 
   word.set("ctx_id");
   context.write(word, one);
   } 
  word.set("ctx_total");
  context.write(word, one);
 
  for(String sample: vcfRow.getHeader().getSampleNamesInOrder())
   {
   Genotype g =vcfRow.getVariantContext().getGenotype(sample);
   word.set(sample+" "+ctx.getType()+" "+g.getType().name());
   context.write(word, one);
   }

  }
 }

The Reducer computes the sum of each category:
public static class IntSumReducer
   extends Reducer<Text,IntWritable,Text,IntWritable> {
 private IntWritable result = new IntWritable();

 public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException {
   int sum = 0;
   for (IntWritable val : values) {
  sum += val.get();
    }
   result.set(sum);
   context.write(key, result);
 }
}

and here is the main program:
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "snp count");
    job.setJarByClass(Test.class);
    job.setMapperClass(VariantMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    Path inputPath=new Path(args[0]);
    job.setInputFormatClass(VcfInputFormat.class);
    FileInputFormat.addInputPath(job, inputPath);
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

Download, compile, Run:
lindenb@hardyweinberg:~/src/hadoop-sandbox$ make -Bn
rm -rf hadoop-2.7.0
curl -L -o hadoop-2.7.0.tar.gz "http://apache.spinellicreations.com/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz"
tar xvfz hadoop-2.7.0.tar.gz
rm hadoop-2.7.0.tar.gz
touch -c hadoop-2.7.0/bin/hadoop
rm -rf htsjdk-1.130
curl -L -o 1.130.tar.gz "https://github.com/samtools/htsjdk/archive/1.130.tar.gz"
tar xvfz 1.130.tar.gz
rm 1.130.tar.gz
(cd htsjdk-1.130 && ant )
mkdir -p tmp dist
javac -d tmp -cp hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0.jar:hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.0.jar:hadoop-2.7.0/share/hadoop/common/lib/hadoop-annotations-2.7.0.jar:hadoop-2.7.0/share/hadoop/common/lib/log4j-1.2.17.jar:htsjdk-1.130/dist/commons-logging-1.1.1.jar:htsjdk-1.130/dist/htsjdk-1.130.jar:htsjdk-1.130/dist/commons-jexl-2.1.1.jar:htsjdk-1.130/dist/snappy-java-1.0.3-rc3.jar -sourcepath src/main/java src/main/java/com/github/lindenb/hadoop/Test.java 
jar cvf dist/test01.jar -C tmp .
rm -rf tmp
mkdir -p input
curl -o input/CEU.exon.2010_09.genotypes.vcf.gz "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/exon/snps/CEU.exon.2010_09.genotypes.vcf.gz"
gunzip -f input/CEU.exon.2010_09.genotypes.vcf.gz
rm -rf output
HADOOP_CLASSPATH=htsjdk-1.130/dist/commons-logging-1.1.1.jar:htsjdk-1.130/dist/htsjdk-1.130.jar:htsjdk-1.130/dist/commons-jexl-2.1.1.jar:htsjdk-1.130/dist/snappy-java-1.0.3-rc3.jar hadoop-2.7.0/bin/hadoop jar dist/test01.jar com.github.lindenb.hadoop.Test \
   input/CEU.exon.2010_09.genotypes.vcf output
cat output/*

Here is the output of the last command:

15/05/05 17:18:34 INFO input.FileInputFormat: Total input paths to process : 1
15/05/05 17:18:34 INFO mapreduce.JobSubmitter: number of splits:1
15/05/05 17:18:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1186897577_0001
15/05/05 17:18:34 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/05/05 17:18:34 INFO mapreduce.Job: Running job: job_local1186897577_0001
15/05/05 17:18:34 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/05/05 17:18:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/05/05 17:18:34 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/05/05 17:18:34 INFO mapred.LocalJobRunner: Waiting for map tasks
15/05/05 17:18:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1186897577_0001_m_000000_0
15/05/05 17:18:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/05/05 17:18:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/05/05 17:18:34 INFO mapred.MapTask: Processing split: file:/home/lindenb/src/hadoop-sandbox/input/CEU.exon.2010_09.genotypes.vcf:0+2530564
15/05/05 17:18:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/05/05 17:18:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/05/05 17:18:34 INFO mapred.MapTask: soft limit at 83886080
15/05/05 17:18:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/05/05 17:18:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/05/05 17:18:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/05/05 17:18:35 INFO mapreduce.Job: Job job_local1186897577_0001 running in uber mode : false
15/05/05 17:18:35 INFO mapreduce.Job:  map 0% reduce 0%
15/05/05 17:18:36 INFO mapred.LocalJobRunner: 
15/05/05 17:18:36 INFO mapred.MapTask: Starting flush of map output
15/05/05 17:18:36 INFO mapred.MapTask: Spilling map output
15/05/05 17:18:36 INFO mapred.MapTask: bufstart = 0; bufend = 7563699; bufvoid = 104857600
15/05/05 17:18:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 24902536(99610144); length = 1311861/6553600
15/05/05 17:18:38 INFO mapred.MapTask: Finished spill 0
15/05/05 17:18:38 INFO mapred.Task: Task:attempt_local1186897577_0001_m_000000_0 is done. And is in the process of committing
(...)
NA12843 SNP HOM_REF 2515
NA12843 SNP HOM_VAR 242
NA12843 SNP NO_CALL 293
NA12872 SNP HET 394
NA12872 SNP HOM_REF 2282
NA12872 SNP HOM_VAR 188
NA12872 SNP NO_CALL 625
NA12873 SNP HET 336
NA12873 SNP HOM_REF 2253
NA12873 SNP HOM_VAR 184
NA12873 SNP NO_CALL 716
NA12874 SNP HET 357
NA12874 SNP HOM_REF 2395
NA12874 SNP HOM_VAR 229
NA12874 SNP NO_CALL 508
NA12878 SNP HET 557
NA12878 SNP HOM_REF 2631
NA12878 SNP HOM_VAR 285
NA12878 SNP NO_CALL 16
NA12889 SNP HET 287
NA12889 SNP HOM_REF 2110
NA12889 SNP HOM_VAR 112
NA12889 SNP NO_CALL 980
NA12890 SNP HET 596
NA12890 SNP HOM_REF 2587
NA12890 SNP HOM_VAR 251
NA12890 SNP NO_CALL 55
NA12891 SNP HET 609
NA12891 SNP HOM_REF 2591
NA12891 SNP HOM_VAR 251
NA12891 SNP NO_CALL 38
NA12892 SNP HET 585
NA12892 SNP HOM_REF 2609
NA12892 SNP HOM_VAR 236
NA12892 SNP NO_CALL 59
ctx_biallelic 3489
ctx_id 3489
ctx_snp 3489
ctx_total 3489


that's it,
Pierre