Showing posts with label api. Show all posts
Showing posts with label api. Show all posts

24 February 2016

Registering a tool in the @ELIXIREurope regisry using XML, XSLT, JSON and curl. My notebook.

The Elixir Registry / pmid:26538599 "A portal to bioinformatics resources world-wide. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools."
In this post, I will describe how I've used the bio.tools API to register some tools from jvarkit.

Authenticate with your credentials

using curl, the 'bio.tools' service returns a authentication token.

$ curl -s \
 -H "Content-type: application/json" \
 -X POST \
 -d '{"username":"my-login@univ-nantes.fr","password":"password1234"}' \
 https://bio.tools/api/auth/login |\
 python -m json.tool
{
    "token": "74dedea023dbad8ecda49ac57bb1074acd794f"
}

Creating a JSON describing the tool.

The tool I'm goind to use is VCFhead. A very simple tool printing the first variants of a VCF file. In jvarkit I don't write the code parsing the arguments, everything is described using a XML file that is going to be processed with a XSTL stylesheet to generate an abstract java code handling the options, etc....

xsltproc command2java VcfHead.xml > AbstractVcfHead.java

For VcfHead the XML descriptor is available here: https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/VcfHead.xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<app xmlns="http://github.com/lindenb/jvarkit/" xmlns:h="http://www.w3.org/1999/xhtml" app="VcfHead" package="com.github.lindenb.jvarkit.tools.misc" >
  <description>Print the first variants of a VCF.</description>
  <input type="vcf"/>
  <output type="vcf"/>
  <options>
    <option name="count" type="int" opt="n" longopt="count" min-inclusive="0" default="10">
      <description>number of variants to be printed</description>
    </option>
  </options>
  <documentation>
    <h:h3>Example</h:h3>
   (...)
  </documentation>
</app>

Using a first XSLT stylesheet https://github.com/lindenb/jvarkit/blob/master/src/main/resources/xsl/jsonxelixir.xsl, 'VcfHead.xml' is firstly converted to the 'infamous' JSONx (JSON+XML) format .
xsltproc jsonxelixir VcfHead.xml > VcfHead.jsonx
The JSONx file:
<?xml version="1.0"?>
<jsonx:object xmlns:jsonx="http://www.ibm.com/xmlns/prod/2009/jsonx" xmlns:c="http://github.com/lindenb/jvarkit/" xmlns="http://www.w3.org/1999/xhtml" xmlns:x="http://www.ibm.com/xmlns/prod/2009/jsonx">
  <jsonx:string name="accessibility">Public</jsonx:string>
  <jsonx:string name="affiliation">univ-nantes.fr</jsonx:string>
  <jsonx:string name="cost">Free</jsonx:string>
  <jsonx:array name="platform">
    <jsonx:string>Linux</jsonx:string>
    <jsonx:string>Mac</jsonx:string>
  </jsonx:array>
  <jsonx:string name="version">1.0</jsonx:string>
  <jsonx:string name="homepage">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
  <jsonx:array name="function">
    <jsonx:object>
      <jsonx:array name="input">
        <jsonx:object>
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          </jsonx:object>
          <jsonx:array name="dataFormat">
            <jsonx:object>
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
            </jsonx:object>
          </jsonx:array>
        </jsonx:object>
      </jsonx:array>
      <jsonx:array name="output">
        <jsonx:object>
          <jsonx:object name="dataType">
            <jsonx:string name="term">File name</jsonx:string>
            <jsonx:string name="uri">http://edamontology.org/data_1050</jsonx:string>
          </jsonx:object>
          <jsonx:string name="dataDescription">any format</jsonx:string>
          <jsonx:array name="dataFormat">
            <jsonx:object>
              <jsonx:string name="term">VCF</jsonx:string>
              <jsonx:string name="uri">http://edamontology.org/format_3016</jsonx:string>
            </jsonx:object>
          </jsonx:array>
        </jsonx:object>
      </jsonx:array>
      <jsonx:array name="functionName">
        <jsonx:object>
          <jsonx:string name="term">Formatting</jsonx:string>
          <jsonx:string name="uri">http://edamontology.org/operation_0335</jsonx:string>
        </jsonx:object>
      </jsonx:array>
      <jsonx:string name="functionDescription">Print the first variants of a VCF.</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="description">Print the first variants of a VCF.</jsonx:string>
  <jsonx:object name="docs">
    <jsonx:string name="docsTermsOfUse">https://opensource.org/licenses/MIT</jsonx:string>
    <jsonx:string name="docsGithub">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsHome">https://github.com/lindenb/jvarkit/wiki/VcfHead</jsonx:string>
    <jsonx:string name="docsCitationInstructions">https://github.com/lindenb/jvarkit/wiki/Citing</jsonx:string>
    <jsonx:string name="docsDownloadSource">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
    <jsonx:string name="docsDownload">https://github.com/lindenb/jvarkit/archive/master.zip</jsonx:string>
  </jsonx:object>
  <jsonx:array name="collection">
    <jsonx:string>jvarkit</jsonx:string>
  </jsonx:array>
  <jsonx:object name="credits">
    <jsonx:array name="creditsInstitution">
      <jsonx:string>Institut du Thorax, Nantes, France</jsonx:string>
    </jsonx:array>
    <jsonx:array name="creditsDeveloper">
      <jsonx:string>Pierre Lindenbaum</jsonx:string>
    </jsonx:array>
  </jsonx:object>
  <jsonx:array name="interface">
    <jsonx:object>
      <jsonx:string name="interfaceType">Command line</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="name">VcfHead</jsonx:string>
  <jsonx:array name="topic">
    <jsonx:object>
      <jsonx:string name="term">Omics</jsonx:string>
      <jsonx:string name="uri">http://edamontology.org/topic_3391</jsonx:string>
    </jsonx:object>
  </jsonx:array>
  <jsonx:string name="license">MIT License</jsonx:string>
  <jsonx:array name="language">
    <jsonx:string>Java</jsonx:string>
  </jsonx:array>
  <jsonx:array name="resourceType">
    <jsonx:string>Tool</jsonx:string>
  </jsonx:array>
  <jsonx:string name="maturity">Stable</jsonx:string>
  <jsonx:array name="contact">
    <jsonx:object>
      <jsonx:string name="contactURL">https://github.com/lindenb</jsonx:string>
      <jsonx:string name="contactName">Pierre Lindenbaum</jsonx:string>
      <jsonx:array name="contactRole">
        <jsonx:string>Developer</jsonx:string>
        <jsonx:string>Maintainer</jsonx:string>
        <jsonx:string>Helpdesk</jsonx:string>
      </jsonx:array>
    </jsonx:object>
  </jsonx:array>
  <jsonx:object name="publications">
    <jsonx:string name="publicationsPrimaryID">doi:10.6084/m9.figshare.1425030.v1</jsonx:string>
  </jsonx:object>
</jsonx:object>

Using another XSLT stylesheet jsonx2json.xsl, the JSONx is converted to a JSON file.
xsltproc jsonx2json.xsl VcfHead.jsonx > VcfHead.json
the JSON file:
{
    "accessibility": "Public", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
        "jvarkit"
    ], 
    "contact": [
        {
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
                "Developer", 
                "Maintainer", 
                "Helpdesk"
            ], 
            "contactURL": "https://github.com/lindenb"
        }
    ], 
    "cost": "Free", 
    "credits": {
        "creditsDeveloper": [
            "Pierre Lindenbaum"
        ], 
        "creditsInstitution": [
            "Institut du Thorax, Nantes, France"
        ]
    }, 
    "description": "Print the first variants of a VCF.", 
    "docs": {
        "docsCitationInstructions": "https://github.com/lindenb/jvarkit/wiki/Citing", 
        "docsDownload": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsDownloadSource": "https://github.com/lindenb/jvarkit/archive/master.zip", 
        "docsGithub": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsHome": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
        "docsTermsOfUse": "https://opensource.org/licenses/MIT"
    }, 
    "function": [
        {
            "functionDescription": "Print the first variants of a VCF.", 
            "functionName": [
                {
                    "term": "Formatting", 
                    "uri": "http://edamontology.org/operation_0335"
                }
            ], 
            "input": [
                {
                    "dataFormat": [
                        {
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                        }
                    ], 
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
                    }
                }
            ], 
            "output": [
                {
                    "dataDescription": "any format", 
                    "dataFormat": [
                        {
                            "term": "VCF", 
                            "uri": "http://edamontology.org/format_3016"
                        }
                    ], 
                    "dataType": {
                        "term": "File name", 
                        "uri": "http://edamontology.org/data_1050"
                    }
                }
            ]
        }
    ], 
    "homepage": "https://github.com/lindenb/jvarkit/wiki/VcfHead", 
    "interface": [
        {
            "interfaceType": "Command line"
        }
    ], 
    "language": [
        "Java"
    ], 
    "license": "MIT License", 
    "maturity": "Stable", 
    "name": "VcfHead", 
    "platform": [
        "Linux", 
        "Mac"
    ], 
    "publications": {
        "publicationsPrimaryID": "doi:10.6084/m9.figshare.1425030.v1"
    }, 
    "resourceType": [
        "Tool"
    ], 
    "topic": [
        {
            "term": "Omics", 
            "uri": "http://edamontology.org/topic_3391"
        }
    ], 
    "version": "1.0"
}

Registering the tool

Now we have the Token and the json descriptor we can add VcfHead to Elixir using curl:
curl  -H "Content-type: application/json" \
 -H "Authorization: Token 74dedea023dbad8ecda49ac57bb1074acd794f" \
 -X POST \
 -d  @path/to/VcfHead.json \
 "https://bio.tools/api/tool" |\
 python -m json.tool
output:
{
    "accessibility": "Public", 
    "additionDate": "2016-02-24T11:37:17.458Z", 
    "affiliation": "univ-nantes.fr", 
    "collection": [
        "jvarkit"
    ], 
    "contact": [
        {
            "contactName": "Pierre Lindenbaum", 
            "contactRole": [
                "Developer", 
                "Maintainer", 
                "Helpdesk"
            ], 
            "contactURL": "https://github.com/lindenb"
(...)

VCfhead is now visible from the Elixir Registry at https://bio.tools/tool/univ-nantes.fr/VcfHead/1.0.
http://i.imgur.com/PjEMjX6.jpg

That's it,
Pierre.

18 June 2015

Playing with the #GA4GH schemas and #Avro : my notebook

After watching David Haussler's talk "Beacon Project and Data Sharing ApIs", I wanted to play with Avro and the models and APIs defined by the Global Alliance for Genomics and Health (ga4gh) coalition Here is my notebook.
(Wikipedia) Avro: "Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services."
First, we download the java tools and libraries for apache Avro
curl -L -o avro-tools-1.7.7.jar "http://www.eng.lsu.edu/mirrors/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar"
Next, we download the schemas defined by the ga4gh from github
curl -L -o schema.zip "https://github.com/ga4gh/schemas/archive/v0.5.1.zip"
unzip schema.zip
rm schema.zip

$ find -name "*.avdl"
./schemas-0.5.1/src/main/resources/avro/readmethods.avdl
./schemas-0.5.1/src/main/resources/avro/common.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadata.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadatamethods.avdl
./schemas-0.5.1/src/main/resources/avro/wip/variationReference.avdl
./schemas-0.5.1/src/main/resources/avro/variants.avdl
./schemas-0.5.1/src/main/resources/avro/variantmethods.avdl
./schemas-0.5.1/src/main/resources/avro/beacon.avdl
./schemas-0.5.1/src/main/resources/avro/references.avdl
./schemas-0.5.1/src/main/resources/avro/referencemethods.avdl
./schemas-0.5.1/src/main/resources/avro/reads.avdl
Those schema can be compiled to java using the avro-tools
$ java -jar avro-tools-1.7.7.jar compile protocol schemas-0.5.1/src/main/resources/avro/ ./generated
Input files to compile:
  schemas-0.5.1/src/main/resources/avro/variants.avpr
  
$ find generated/org/ -name "*.java"
generated/org/ga4gh/GAPosition.java
generated/org/ga4gh/GAVariantSetMetadata.java
generated/org/ga4gh/GACall.java
generated/org/ga4gh/GAException.java
generated/org/ga4gh/GACigarOperation.java
generated/org/ga4gh/GAVariantSet.java
generated/org/ga4gh/GAVariants.java
generated/org/ga4gh/GAVariant.java
generated/org/ga4gh/GACallSet.java
generated/org/ga4gh/GACigarUnit.java
As a test, the following java source uses the classes generated by avro to create nine variants and serialize them to Avro

Compile, archive and execute:
#compile classes
javac -d generated -cp avro-tools-1.7.7.jar -sourcepath generated:src generated/org/ga4gh/*.java src/test/TestAvro.java
# archive
jar cvf generated/ga4gh.jar -C generated org -C generated test
# run
java -cp avro-tools-1.7.7.jar:generated/ga4gh.jar test.TestAvro > variant.avro
We use the avro-tools to convert the generated file variant.avro to json
java -jar avro-tools-1.7.7.jar tojson variant.avro

Output:

The complete Makefile



That's it,
Pierre

30 January 2015

Filtering Fasta Sequences using the #NCBI C++ API. My notebook.

In my previous post (http://plindenbaum.blogspot.com/2015/01/compiling-c-hello-world-program-using.html) I've built a simple "Hello World" application using the
NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna to extend the code in order to create a program filtering FASTA sequences on their sizes.

This new application FastaFilterSize needs three new arguments:

  • '-m' min size of the fasta sequence
  • '-M' max size of the fasta sequence
  • '-v' inverse selection

the method FastaFilterSize::Init() is used to register those optional arguments

/* optional argument for min size */
arg_desc->AddOptionalKey(
    "m",
    "min_size",
    "min size inclusive",
    CArgDescriptions::eInteger
    );
/* optional argument for max size */
arg_desc->AddOptionalKey(
    "M",
    "max_size",
    "max size inclusive",
    CArgDescriptions::eInteger
    );
/* add simple flag to inverse selection */
arg_desc->AddFlag("v", 
    "inverse selection",
    true
    );

the FastaFilterSize::Run() method is the workhorse of the program. Here, we get the output stream that will be used by a CFastaOstream to print the filtered Fasta sequences:

(...)
/* get output stream */
CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
/* create a FASTA writer */
CFastaOstream* writer = new CFastaOstream(out);
(...)

We get the input stream that will be used by a CFastaReader to read the FASTA sequences to be filtered.

/* get input stream */
CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
CFastaReader reader(input_stream ,
        CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
        );

We get the optional values for min-length and max-length:

if (this->GetArgs()["m"])
    {
    min_inclusive = this->GetArgs()["m"].AsInteger();
    }
if (this->GetArgs()["M"])
    {
    max_inclusive = this->GetArgs()["M"].AsInteger();
    }

We get the 'inverse' flag.

bool invert=this->GetArgs()["v"];

While a Fasta sequence can be read, we check its size and we print it if needed:

while( !reader.AtEOF() )
    {
    /* read next sequence */
    CRef<CSeq_entry> seq(reader.ReadOneSeq());
    /* should we print the sequence ?*/
    if((seq->GetSeq().GetLength()>= min_inclusive && 
        seq->GetSeq().GetLength()<= max_inclusive
        )!= invert)
        {
        /* yes, print */
        writer->Write(*seq);
        }
    }

All in one:

#include <memory>
#include <limits>
#include <objtools/readers/fasta.hpp>
#include "corelib/ncbiapp.hpp"
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Delta_ext.hpp>
#include <objects/seq/Delta_seq.hpp>
#include <objects/seq/NCBIeaa.hpp>
#include <objects/seq/IUPACaa.hpp>
#include <objects/seq/IUPACna.hpp>
#include <objects/seq/Seg_ext.hpp>
#include <objects/seq/Seq_annot.hpp>
#include <objects/seq/Seq_descr.hpp>
#include <objects/seq/Seq_ext.hpp>
#include <objects/seq/Seq_hist.hpp>
#include <objects/seq/Seq_inst.hpp>
#include <objects/seq/Seq_literal.hpp>
#include <objects/seq/Seqdesc.hpp>
#include <objects/seq/seqport_util.hpp>
#include <objects/seqalign/Dense_seg.hpp>
#include <objects/seqalign/Seq_align.hpp>
#include <objects/seqloc/Seq_id.hpp>
#include <objects/seqloc/Seq_interval.hpp>
#include <objects/seqloc/Seq_loc.hpp>
#include <objects/seqloc/Seq_loc_mix.hpp>
#include <objects/seqloc/Seq_point.hpp>
#include <objects/seqset/Bioseq_set.hpp>
#include <objects/seqset/Seq_entry.hpp>
#include <objtools/readers/message_listener.hpp>
#include <objtools/readers/line_error.hpp>
#include <objtools/seqmasks_io/mask_reader.hpp>
#include <objtools/seqmasks_io/mask_writer.hpp>
#include <objtools/seqmasks_io/mask_fasta_reader.hpp>
#include <objmgr/util/sequence.hpp>
#include <ctype.h>

USING_NCBI_SCOPE;
USING_SCOPE(objects);

/**
 * FastaFilterSize
 *    
 *  Motivation:
 *      filter sequences on size
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class FastaFilterSize : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        FastaFilterSize()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                "A program filtering fasta sequences on size"
                );
            /* argument for input file */
            arg_desc->AddDefaultKey(
                    "i",/* name */
                    "input_file_name",/* synopsis */
                    "input file name",/* comment */
                    CArgDescriptions::eInputFile, /* argument type */
                    "-" /* default value*/
                    );
           /* argument for output file */
           arg_desc->AddDefaultKey(
                    "o",/* name */
                    "output_file_name",/* synopsis */
                    "output file name",/* comment */
                    CArgDescriptions::eOutputFile, /* argument type */
                    "-" /* default value*/
                    );
            /* optional argument for min size */
            arg_desc->AddOptionalKey(
                    "m",
                    "min_size",
                    "min size inclusive",
                    CArgDescriptions::eInteger
                    );
            /* optional argument for max size */
            arg_desc->AddOptionalKey(
                    "M",
                    "max_size",
                    "max size inclusive",
                    CArgDescriptions::eInteger
                    );
            
            /* add simple flag to inverse selection */
            arg_desc->AddFlag("v", 
                "inverse selection",
                true
                );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~FastaFilterSize()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            /* get output stream */
            CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
            /* create a FASTA writer */
            CFastaOstream* writer = new CFastaOstream(out);
            /* shall we inverse the selection ? */
            bool invert=this->GetArgs()["v"];

            /* get input stream */
            CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
            /* init min and max inclusive */
            TSeqPos min_inclusive=0;
            TSeqPos max_inclusive= std::numeric_limits<TSeqPos>::max();
            /* if min and max set by user */
            if (this->GetArgs()["m"])
                {
                min_inclusive = this->GetArgs()["m"].AsInteger();
                }
            if (this->GetArgs()["M"])
                {
                max_inclusive = this->GetArgs()["M"].AsInteger();
                }
            /* create a FASTA parser */
            CFastaReader reader(input_stream ,
                    CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
                    );
            /* while we can read something */
            while( !reader.AtEOF() )
                {
                /* read next sequence */
                CRef<CSeq_entry> seq(reader.ReadOneSeq());
                
                /* should we print the sequence ?*/
                if((seq->GetSeq().GetLength()>= min_inclusive && 
                    seq->GetSeq().GetLength()<= max_inclusive
                    )!= invert)
                    {
                    /* yes, print */
                    writer->Write(*seq);
                    }
                }
            /* cleanup */
            delete writer;
            out.flush();
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return FastaFilterSize().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "filterfasta" /* Specify application name */
        );
    }

Compiling

g++  -std=gnu++11  -finline-functions -fstrict-aliasing -fomit-frame-pointer    
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/include 
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseM
T/inc  -Wno-deprecated -Wno-deprecated-declarations -o filtersize filtersize.cpp
 -L/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/Release
MT/lib -lblast_app_util -lblastinput-static -lncbi_xloader_blastdb-static -lncbi
_xloader_blastdb_rmt-static -lxblastformat-static -lalign_format-static -ltaxon1
-static -lblastdb_format-static -lgene_info-static -lxalnmgr-static -lblastxml-s
tatic -lblastxml2-static -lxcgi-static -lxhtml-static -lxblast-static -lxalgobla
stdbindex-static -lcomposition_adjustment-static -lxalgodustmask-static -lxalgow
inmask-static -lseqmasks_io-static -lseqdb-static -lblast_services-static -lxobj
util-static -lxobjread-static -lvariation-static -lcreaders-static -lsubmit-stat
ic -lxnetblastcli-static -lxnetblast-static -lblastdb-static -lscoremat-static -
ltables-static -lxalnmgr-static -lncbi_xloader_genbank-static -lncbi_xreader_id1
-static -lncbi_xreader_id2-static -lncbi_xreader_cache-static -lxconnext-static 
-lxconnect-static -ldbapi_driver-static -lncbi_xreader-static -lxconnext-static 
-lxconnect-static -lid1-static -lid2-static -lseqsplit-static -lxcompress-static
 -lxobjmgr-static -lgenome_collection-static -lseqedit-static -lseqset-static -l
seq-static -lseqcode-static -lsequtil-static -lpub-static -lmedline-static -lbib
lio-static -lgeneral-static -lxser-static -lxutil-static -lxncbi-static -lgomp -
lz -lbz2 -ldl -lnsl -lrt -lm -lpthread -L/commun/data/users/lindenb/src/blast2.2
.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_app_util -lblastinput-stati
c -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb_rmt-static -lxblastformat
-static -lalign_format-static -ltaxon1-static -lblastdb_format-static -lgene_inf
o-static -lxalnmgr-static -lblastxml-static -lblastxml2-static -lxcgi-static -lx
html-static -lxblast-static -lxalgoblastdbindex-static -lcomposition_adjustment-
static -lxalgodustmask-static -lxalgowinmask-static -lseqmasks_io-static -lseqdb
-static -lblast_services-static -lxobjutil-static -lxobjread-static -lvariation-
static -lcreaders-static -lsubmit-static -lxnetblastcli-static -lxnetblast-stati
c -lblastdb-static -lscoremat-static -ltables-static -lxalnmgr-static -lncbi_xlo
ader_genbank-static -lncbi_xreader_id1-static -lncbi_xreader_id2-static -lncbi_x
reader_cache-static -lxconnext-static -lxconnect-static -ldbapi_driver-static -l
ncbi_xreader-static -lxconnext-static -lxconnect-static -lid1-static -lid2-stati
c -lseqsplit-static -lxcompress-static -lxobjmgr-static -lgenome_collection-stat
ic -lseqedit-static -lseqset-static -lseq-static -lseqcode-static -lsequtil-stat
ic -lpub-static -lmedline-static -lbiblio-static -lgeneral-static -lxser-static 
-lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -lnsl -lrt -lm -lpthread

Testing

get help

$ ./filtersize  -help
USAGE
  filtersize [-h] [-help] [-xmlhelp] [-i input_file_name]
    [-o output_file_name] [-m min_size] [-M max_size] [-v]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   A program filtering fasta sequences on size

OPTIONAL ARGUMENTS
 (...)
 -i <File_In>
   input file name
   Default = `-'
 -o <File_Out>
   output file name
   Default = `-'
 -m <Integer>
   min size inclusive
 -M <Integer>
   max size inclusive
 -v
   inverse selection
(...)

test with some fasta sequences:

echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20

>lcl|1 seq1
AATCGTACGCTAG
echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20 -v 

>lcl|2 seq2
AATCGTACGATGCAAAAAAAAAAAATCTAG
>lcl|3 seq3
AATC
>lcl|4 seq3
AATC

That's it,
Pierre

26 September 2013

Presentation: Advanced NCBI

Here is my presentation for the course "Advanced NCBI", I gave yesterday at the University of Nantes.


11 February 2013

Counting the reads in a BAM file using SGE and the Open-MPI library: my notebook.

In the current post, I'll describe my first Open MPI program. Open MPI is "a Message Passing Interface (MPI) library, a standardized and portable message-passing system to function on a wide variety of parallel computers". My C program takes a list of BAMs, distributes some jobs on the SGE (SUN/Oracle Grid Engine) to count the number of reads and returns the results to a master process.

Downloading and installing

The library was downloaded from http://www.open-mpi.org/software/ompi/v1.6/, compiled with the option --with-sge and installed in '/commun/data/packages/openmpi'.
./configure --prefix=/commun/data/packages/openmpi --with-sge
make 
make install

Configuring SGE to run MPI-based program

As described in http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html .
$ su
$ source /commun/sge_root/bird/common/settings.csh
$ setenv ARCH lx24-amd64
$ qconf -ap orte
And added:
pe_name            orte
slots              448
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
Add orte to the parallele env list:
qconf -mattr queue pe_list "orte" all.q
Using qconf -mconf I've also set shell_start_mode to 'unix_behavior'.

Algorithm

In the 'main' method, test if this is the master(==0) or a slave process(!=0):
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
 {
 master(argc-1,&argv[1]);
 }
else
 {
 slave(argc-1,&argv[1]);
 }

MASTER

The master gets the number of available proc:
MPI_Comm_size(MPI_COMM_WORLD, &proc_count)
Loop over the number of available processors and the BAMS, for each BAM, create a fixed size structure containing the path to the BAM as well as the count of reads initialized to '0':
typedef struct Params_t
 {
 char filename[FILENAME_MAX];
 int is_error;
 long count_total;
 long count_mapped;
 long count_properly_paired;
 long count_dup;
 }Params;
This structure is sent to the slaves:
 MPI_Send(
 &param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_COUNT_BAM, /* user chosen message tag */
 MPI_COMM_WORLD /* default communicator */
 ); 
and we wait for the slaves to return those 'Params' with the filled values.:
MPI_Recv(&param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 MPI_ANY_SOURCE, /* receive from any sender */
 MPI_ANY_TAG, /* any type of message */
 MPI_COMM_WORLD, /* default communicator */
 &status);
And the result is printed:
printf("%s\t%ld\t%ld\t%ld\t%ld\n",
 param.filename,
 param.count_total,
 param.count_mapped,
 param.count_properly_paired,
 param.count_dup
 );
At the end of the master method, when no more BAM is available, the slaves are released (tag is 'TAG_RELEASE_SLAVE') with :
MPI_Send(
 &empty, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_RELEASE_SLAVE, /* SHutdown */
 MPI_COMM_WORLD
 ); /* default communicator */

SLAVE

The slave method receives some messages from the master:
MPI_Recv(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 MPI_ANY_TAG,
 MPI_COMM_WORLD,
 &status);
If the message was 'TAG_RELEASE_SLAVE' we exit the method. Else the BAM file is scanned using the SAmtools API for C.
count_reads(&param);
and the result is sent back to the master:
MPI_Send(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 0,
 MPI_COMM_WORLD
 );
.

Running the Job on SGE

I still don't understand why I need to write the following line:
# qsh -pe orte 4
Your job 84581 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 84581 has been successfully scheduled.
Run the analysis:
qrsh -cwd -pe orte 100 /commun/data/packages/openmpi/bin/mpirun \
 /path/to/ompibam \
 `find /commun/data/projects/folder1234/align -name "*.bam"`
qstat -f shows the jobs running:
arch          states
   ---------------------------------------------------------------------------------
   all.q@node01                   BIP   0/15/64        2.76 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node02                   BIP   0/14/64        3.89 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node03                   BIP   0/14/64        3.23 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node04                   BIP   0/14/64        3.68 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node05                   BIP   0/15/64        2.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node06                   BIP   0/14/64        3.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node07                   BIP   0/14/64        3.79 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14 
output:
#filename total mapped properly_paired dup
/commun/data/projects/folder1234/11X0589_markdup.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_pair1_sorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_realigned.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_recal.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0602_pair1_sorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_realigned.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_markdup.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_pair1_unsorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0375_pair1_sorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_realigned.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_markdup.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_pair1_sorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0367_markdup.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0375_pair1_unsorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_realigned.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0291_pair1_sorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_markdup.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_realigned.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_markdup.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0311_realigned.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0375_recal.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0589_pair1_unsorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0367_pair1_unsorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0290_pair1_sorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_realigned.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0291_pair1_unsorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_pair1_sorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0240_pair1_sorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_realigned.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_markdup.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_sorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_markdup.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_realigned.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0602_recal.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0290_markdup.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_pair1_unsorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0247_pair1_sorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_pair1_sorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0456_realigned.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_markdup.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0247_realigned.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_markdup.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0367_recal.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0305_pair1_sorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_markdup.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_realigned.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_pair1_sorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_recal.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0542_realigned.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0542_markdup.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0311_pair1_unsorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0291_recal.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/XX07551_pair1_sorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0240_pair1_unsorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_unsorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX07551_markdup.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/XX07551_realigned.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0305_pair1_unsorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0311_recal.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0456_pair1_unsorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_pair1_unsorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0542_pair1_unsorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_pair1_unsorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0433_pair1_sorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0433_markdup.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/XX10272_recal.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0433_realigned.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0240_recal.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0290_recal.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0456_recal.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_recal.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0433_pair1_unsorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0305_recal.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_recal.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0433_recal.bam 2178798 1800943 1249738 0
However I noticed seems that running this code on the cluster is slower that running it on the master node...

The C code

Note: the C code is compiled with ortecc, not gcc.


I'm still very new to this API and to those concepts. Feel free to suggest anything :-)
That's it,
Pierre

02 November 2012

Saving your tweets in a database using sqlite, rhino, scribe, javascript

In the current post, I 'll describe a simple method to save your tweets in a sqlite database using Mozilla Rhino.

Prerequisites

  • sqlite
  • Apache Rhino. I think it should be de-facto available when the java developer toolkit (JDK) is installed
  • Scribe, the simple OAuth library for Java . It also requires Apache codec

The config.js file

Open an account on https://dev.twitter.com/ and create an App to receive an API-key and an API-secret.
Create the following file 'config.js' filled with the correct parameters.

The javascript

The following javascript file opens a Oauth connection, retrieves the tweets and stores them into sqlite. I've commented the code, I hope it is clear enough.

Running the script using Rhino

scribe.libs=/path/to/scribe-1.3.2.jar:/path/to/commons-codec.jar
rhino.libs=/usr/share/java/js.jar:/usr/share/java/jline.jar
sqlite.libs=/path/to/sqlitejdbc-v056.jar
CLASSPATH=${rhino.libs}:${scribe.libs}:${sqlite.libs}

java -cp ${CLASSPATH} org.mozilla.javascript.tools.shell.Main -f twitter2sqlite.js
At the first time, the user is asked to authorize the application to use the twitter API

The script runs forever (Ctrl-C to break), listening to the new tweets.

As a test, I wrote the following tweet:


... and the tweet was later inserted in the database...

Sleep...

Inserted ({created_at:"Fri Nov 02 20:29:04 +0000 2012", id:264464160664981500, id_str:"264464160664981504", text:"wrote a tool to save my tweets: This is a test . ( #rhino, #jdbc, #sqlite, #scribe #javascript )", source:"web", truncated:false, in_reply_to_status_id:null, in_reply_to_status_id_str:null, in_reply_to_user_id:null, in_reply_to_user_id_str:null, in_reply_to_screen_name:null, geo:null, coordinates:null, place:null, contributors:null, retweet_count:0, entities:{hashtags:[{text:"rhino", indices:[51, 57]}, {text:"jdbc", indices:[59, 64]}, {text:"sqlite", indices:[66, 73]}, {text:"scribe", indices:[75, 82]}, {text:"javascript", indices:[83, 94]}], urls:[], user_mentions:[]}, favorited:false, retweeted:false})

Sleep...
Sleep...
Sleep...

Later, the tweets can be extracted using the sqlite command line:

$  sqlite3 tweets.sqlite 'select * from tweet'

264464160664981504|({created_at:"Fri Nov 02 20:29:04 +0000 2012", id:264464160664981500, id_str:"264464160664981504", text:"wrote a tool to save my tweets: This
264421310841638913|({created_at:"Fri Nov 02 17:38:47 +0000 2012", id:264421310841638900, id_str:"264421310841638913", text:"The tools for recalibration have cha
264264932097400832|({created_at:"Fri Nov 02 07:17:24 +0000 2012", id:264264932097400830, id_str:"264264932097400832", text:"@warandpeace you're welcome. Your sh
264158323287416832|({created_at:"Fri Nov 02 00:13:46 +0000 2012", id:264158323287416830, id_str:"264158323287416832", text:"Drawing of the day November 1, 2012.
264142732174438400|({created_at:"Thu Nov 01 23:11:49 +0000 2012", id:264142732174438400, id_str:"264142732174438400", text:"[delicious] PLOS Collections: How th
264064117558624256|({created_at:"Thu Nov 01 17:59:26 +0000 2012", id:264064117558624260, id_str:"264064117558624256", text:"I've added a stupid basic dependency
264025607724204034|({created_at:"Thu Nov 01 15:26:24 +0000 2012", id:264025607724204030, id_str:"264025607724204034", text:"in the desert lab, checking my on-go
264013563704795136|({created_at:"Thu Nov 01 14:38:33 +0000 2012", id:264013563704795140, id_str:"264013563704795136", text:"Drawing of the day November 1, 2012.
263996436679630848|({created_at:"Thu Nov 01 13:30:29 +0000 2012", id:263996436679630850, id_str:"263996436679630848", text:"RT @RealistComics: he's tall, dark a
263966759210590208|({created_at:"Thu Nov 01 11:32:34 +0000 2012", id:263966759210590200, id_str:"263966759210590208", text:"RT @guermonprez: #Aubry Un avion nor
263946369847398402|({created_at:"Thu Nov 01 10:11:33 +0000 2012", id:263946369847398400, id_str:"263946369847398402", text:"[delicious] OVal: object validation 
263946366919790593|({created_at:"Thu Nov 01 10:11:32 +0000 2012", id:263946366919790600, id_str:"263946366919790593", text:"[delicious] MyBatis #tweet: a first 
263941020729896960|({created_at:"Thu Nov 01 09:50:17 +0000 2012", id:263941020729896960, id_str:"263941020729896960", text:"RT @josh_wills: I have never been pr
263938670187388928|({created_at:"Thu Nov 01 09:40:57 +0000 2012", id:263938670187388930, id_str:"263938670187388928", text:"RT @softmodeling @peterneubauer: Usi
263936362716200960|({created_at:"Thu Nov 01 09:31:47 +0000 2012", id:263936362716200960, id_str:"263936362716200960", text:"declined to review an article about 
263934528186351616|({created_at:"Thu Nov 01 09:24:29 +0000 2012", id:263934528186351600, id_str:"263934528186351616", text:"@figshare Thanks, ( was http://t.co/
263815846139412480|({created_at:"Thu Nov 01 01:32:53 +0000 2012", id:263815846139412480, id_str:"263815846139412480", text:"Drawing of the day October 30, 2012.
263731855919026176|({created_at:"Wed Oct 31 19:59:09 +0000 2012", id:263731855919026180, id_str:"263731855919026176", text:"[delicious] An integrated map of gen
263726281647067136|({created_at:"Wed Oct 31 19:36:59 +0000 2012", id:263726281647067140, id_str:"263726281647067136", text:"RT @bryan_howie: 1000 Genomes paper 
263695076516052992|({created_at:"Wed Oct 31 17:33:00 +0000 2012", id:263695076516053000, id_str:"263695076516052992", text:"\"Forget your Past\" ( abandoned Bul

That's it
Pierre

27 September 2012

Playing with the #Ensembl REST API: filtering a VCF with javascript

The new Ensembl REST API was announced today: "We are pleased to announce the beta release of our programming language agnostic REST API, for Release 68 data, at http://beta.rest.ensembl. Our initial release provides access to:

  • Sequences (genomic, cDNA, CDS and protein)
  • VEP (Variant Effect Predictor)
  • Homologies
  • Gene Trees
  • Assembly and coordinate mapping."

In the current post I will filter a VCF file with this API and javascript.

The VCF

Our initial file is the following VCF:
##fileformat=VCFv4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10327 rs112750067 T C . PASS DP=65
1 69427 rs148502021 T A . PASS DP=1557
1 69452 rs142004627 A G . PASS DP=155
1 69475 rs148502021 C T . PASS DP=231
1 865583 rs148711625 A G . PASS DP=231
1 866460 rs148884928 A G . PASS DP=23
1 866461 . G A . PASS DP=24

The javascript

The VCF will be read on the standard input using the following script and the Rhino JS engine. The script reads the VCF and for each substitution, it calls the Ensembl REST API, parses the JSON response and return the transcript identifier if the mutation is a missense_variant or a polyphen_prediction="probably damaging":
importPackage(java.io);
importPackage(java.lang);



function sleep(milliseconds)
  {
  /* hacked from http://www.phpied.com/sleep-in-javascript/ */
  var start = new Date().getTime();
  for (var i = 0; i < 1e7; i++) {
    if ((new Date().getTime() - start) > milliseconds){
      break;
      }
    }
  }


function damagingTranscript(json)
 {
 for(var d in json.data)
  {
  var transcripts=json.data[d].transcripts;
 
  if(!transcripts) return null;
  for(var t in transcripts)
   {
   var transcript=transcripts[t];
   for(var a in transcript.alleles)
    {
    var allele=transcript.alleles[a];
    if(allele.polyphen_prediction=="probably damaging" ||
       allele.consequence_terms.indexOf("missense_variant")!=-1 )
        {
        return transcript.transcript_id;
        }
    }
   }
  }
 return null;
 }

var baseRegex=new RegExp(/^[ATGCatgc]$/);



var stdin = new java.io.BufferedReader( new java.io.InputStreamReader(java.lang.System['in']) );
var line;
while((line=stdin.readLine())!=null)
 {
 if(line.startsWith("#"))
  {
  print(line); continue;
  }
 var tokens=line.split("\t");
 var chrom=tokens[0];
 var pos= parseInt(tokens[1]);
 var ref= tokens[3];
 var alt= tokens[4];
 if(!baseRegex.test(ref)) continue;
 if(!baseRegex.test(alt)) continue;
 
 sleep(200);
 var url="http://beta.rest.ensembl.org/vep/human/"+
  chrom+":"+pos+"-"+pos+"/"+alt+
  "/consequences?content-type=application/json";
 var jsonStr=readUrl(url,"UTF-8");
 var json=eval("("+jsonStr+")");
 var transcript=damagingTranscript(json);
 if(transcript==null) continue;
 tokens[7]+=(";TRANSCRIPT="+transcript);
 print(tokens.join('\t'));
 }

As an example, here is the response from the ENSEMBL server for 1:866460 A/G:

http://beta.rest.ensembl.org/vep/human/1:866460-866460/G/consequences?content-type=application/json
{
    "data": [
        {
            "location": {
                "coord_system": "chromosome",
                "name": "1",
                "strand": 1,
                "end": "866460",
                "start": "866460"
            },
            "hgvs": {
                "G": "1:g.866460C>G"
            },
            "transcripts": [
                {
                    "translation_stable_id": "ENSP00000411579",
                    "intron_number": null,
                    "cdna_end": 385,
                    "translation_end": 99,
                    "exon_number": "4/7",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000420190",
                    "cdna_start": 385,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000411579.1:p.Ala99Gly",
                            "sift_score": 0.04,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000420190.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000393181",
                    "intron_number": null,
                    "cdna_end": 356,
                    "translation_end": 99,
                    "exon_number": "4/5",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000437963",
                    "cdna_start": 356,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000393181.1:p.Ala99Gly",
                            "sift_score": 0.03,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000437963.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000342313",
                    "intron_number": null,
                    "cdna_end": 379,
                    "translation_end": 99,
                    "exon_number": "4/14",
                    "is_canonical": 1,
                    "transcript_id": "ENST00000342066",
                    "cdna_start": 379,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000342313.3:p.Ala99Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000342066.3:c.296C>G"
                        }
                    },
                    "translation_start": 99,
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "ccds": "CCDS2.2",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000349216",
                    "intron_number": null,
                    "cdna_end": 67,
                    "translation_end": 23,
                    "exon_number": "2/12",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000341065",
                    "cdna_start": 67,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 68,
                    "translation_start": 23,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.008,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000349216.4:p.Ala23Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000341065.4:c.67C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 68,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                }
            ]
        }
    ]
}

Invoking RHINO, filtering the VCF

$ cat input.vcf | rhino -f restensembl.js
  
##fileformat=VCFv4.0
##INFO=
#CHROM POS ID REF ALT QUAL FILTER INFO
1 69427 rs148502021 T A . PASS DP=1557;TRANSCRIPT=ENST00000335137
1 69452 rs142004627 A G . PASS DP=155;TRANSCRIPT=ENST00000335137
1 69475 rs148502021 C T . PASS DP=231;TRANSCRIPT=ENST00000335137
1 865583 rs148711625 A G . PASS DP=231;TRANSCRIPT=ENST00000420190
1 866460 rs148884928 A G . PASS DP=23;TRANSCRIPT=ENST00000420190

Limitations for the Ensembl REST API:




That's it,

Pierre