Showing posts with label ncbi. Show all posts
Showing posts with label ncbi. Show all posts

09 September 2016

Playing with #magicblast, the #NCBI Short read mapper. My notebook

NCBI MAGIC Blast was recently mentioned by BioMickWatson on twitter.



Here, I'll be playing with magicblast and I'll compare its output with bwa (Makefile below).

First, here is an extract of the manual for magicblast.
USAGE
DESCRIPTION
   Short read mapper

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
    * Incompatible with:  sra
 -infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>
   Input format for sequences
   Default = `fasta'
    * Incompatible with:  sra
 -paired
   Input query sequences are paired
 -query_mate <File_In>
   FASTA file with mates for query sequences (if given in another file)
    * Requires:  query
 -sra <String>
   Comma-separated SRA accessions
    * Incompatible with:  query, infmt

 *** Formatting options
 -outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >
   alignment view options:
   sam = SAM format,
   tabular = Tabular format,
   text ASN.1
   Default = `sam'

Indexing the reference for magicblast

I've indexed the human chr22 using the good old NCBI makeblastdb:
$ wget -O "chr22.fa.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz"
$ gunzip -f chr22.fa.gz
$ makeblastdb -in chr22.fa -dbtype nucl -title chr22

Mapping Paired-End reads with magicblast

First I've removed the read names' suffixes '/1' and '/2' because they still appear in the final bam.
gunzip -c R1.aa.fq.gz | sed 's/\/1$$//' | gzip > R1.fq.gz
gunzip -c R2.aa.fq.gz | sed 's/\/2$$//' | gzip > R2.fq.gz
The reads are then mapped with magicblast using the following command:
magicblast -db chr22 \
 -infmt fastq -paired \
 -query R1.fq.gz \
 -query_mate R2.fq.gz |\
 sed 's/gnl\|BL_ORD_ID\|0/chr22/g' |\
 samtools sort -o child.magic.bam -T child.magic --reference chr22.fa -O BAM
As far as I could see, there was no option to specify the read group (RG).

Here,I've transformed the name of the contig because it looks like a blast-id identifier:
I've also mapped the reads using bwa:

bwa mem  chr22.fa R1.fq.gz R2.fq.gz |\
 samtools sort -o child.bwa.bam -T child.bwa --reference chr22.fa -O BAM

BAM headers

Here is the BAM header for magicblast: Magic blast uses the spec v1.2 and has a Group-Order flag.

@HD VN:1.2 GO:query SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:0 PN:magicblast magicblast -db chr22.fa -infmt fastq -paired -query R1.fq.gz -query_mate R2.fq.gz

And the BAM header for bwa:
@HD VN:1.3 SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:bwa PN:bwa VN:0.7.15-r1140 bwa mem chr22.fa R1.fq.gz R2.fq.gz

Samtools samflags

for magicblast:
24387 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24387 + 0 mapped (100.00% : N/A)
24387 + 0 paired in sequencing
12222 + 0 read1
12222 + 0 read2
24330 + 0 properly paired (99.77% : N/A)
24387 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
57 + 0 with mate mapped to a different chr
57 + 0 with mate mapped to a different chr (mapQ>=5)

for bwa:
24001 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1 + 0 supplementary
0 + 0 duplicates
23976 + 0 mapped (99.90% : N/A)
24000 + 0 paired in sequencing
12000 + 0 read1
12000 + 0 read2
23840 + 0 properly paired (99.33% : N/A)
23952 + 0 with itself and mate mapped
23 + 0 singletons (0.10% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Validationg with picard

Validation of child.magic.bam with picard generates many errors:
$ java -jar picard.jar  ValidateSamFile  I=child.magic.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
ERROR: Header version: 1.2 does not match any of the acceptable versions: 1.0, 1.3, 1.4, 1.5
ERROR: Record 4, Read name HWI-1KL149:97:C6Y6VACXX:4:2306:7588:23627, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 6, Read name HWI-1KL149:97:C6Y6VACXX:5:2303:6772:28953, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 15, Read name HWI-1KL149:97:C6Y6VACXX:4:1207:16508:83772, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 17, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 19, Read name HWI-1KL149:97:C6Y6VACXX:5:1315:20109:45547, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 29, Read name HWI-1KL149:97:C6Y6VACXX:4:2313:12478:85202, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 36, Read name HWI-1KL149:97:C6Y6VACXX:4:1202:11437:96159, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 40, Read name HWI-1KL149:97:C6Y6VACXX:4:1213:16611:87818, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 45, Read name HWI-1KL149:97:C6Y6VACXX:4:1208:7944:83271, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 49, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 22, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 25, Read name HWI-1KL149:97:C6Y6VACXX:4:2209:8319:82198, Mate negative strand flag does not match read negative strand flag of mate
(...)
while there is ~no error for child.bwa.bam:
$ java -jar picard.jar  ValidateSamFile  I=child.bwa.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
[Fri Sep 09 16:09:28 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=78643200



Variant Calling

Both BAMs are mapped with samtools/vcftools (min DP=10)
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.bwa.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.bwa.vcf -
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.magic.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.magic.vcf -
Number of variants:
$ grep -v "#" child.magic.vcf  | wc -l
111
$ grep -v "#" child.bwa.vcf  | wc -l
166

Here is the diff for CHROM/POS/REF/ALT:
  • Uniq to BWA: 'comm -13 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 64 variants
  • Uniq to Magic: 'comm -23 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 9 variants
  • Common variants: 'comm -12 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 102 variants

The Makefile

SHELL=/bin/bash
data.dir=${HOME}/src/DATA
ncbi.bin=${HOME}/packages/magicblast/bin
REF=chr22.fa
samtools.exe=${HOME}/packages/samtools/samtools 
bwa.exe=${HOME}/packages/bwa-0.7.15/bwa 

all: child.magic.bam child.bwa.bam

R1.fq.gz : ${HOME}/src/DATA/child.R1.aa.fq.gz 
 gunzip -c $< | sed 's/\/1$$//' | gzip > $@
R2.fq.gz : ${HOME}/src/DATA/child.R2.aa.fq.gz 
 gunzip -c $< | sed 's/\/2$$//' | gzip > $@

child.bwa.bam : ${REF}.bwt R1.fq.gz R2.fq.gz 
 ${bwa.exe} mem  ${REF} $(word 2,$^) $(word 3,$^) |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 


child.magic.bam : ${REF}.nin R1.fq.gz R2.fq.gz 
 ${ncbi.bin}/magicblast -db ${REF} -infmt fastq -paired -query $(word 2,$^) -query_mate $(word 3,$^) |\
 sed 's/gnl\|BL_ORD_ID\|0/$(basename ${REF})/g' |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 

${REF}.bwt : ${REF}
 ${bwa.exe} index $<

${REF}.nin : ${REF}
 ${ncbi.bin}/makeblastdb -in $< -dbtype nucl -title $(basename ${REF})

${REF} :
 wget -O "$@.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@.gz"
 gunzip -f $@.gz

That's it,
Pierre

21 May 2016

Playing with the @ORCID_Org / @ncbi_pubmed graph. My notebook.

"ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. "
I've recently discovered that pubmed now integrates ORCID identfiers.

And there are several minor problems, I found some articles where the ORCID id is malformed or where different people use the same ORCID-ID:







You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID].
I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.
<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
  <!--Generated with PubmedOrcidGraph https://github.com/lindenb/jvarkit/wiki/PubmedOrcidGraph - Pierre Lindenbaum.-->
  <PubmedArticle pmid="27197243" doi="10.1101/gr.199760.115">
    <year>2016</year>
    <journal>Genome Res.</journal>
    <title>Improved definition of the mouse transcriptome via targeted RNA sequencing.</title>
    <Author orcid="0000-0002-4078-7413">
      <foreName>Giovanni</foreName>
      <lastName>Bussotti</lastName>
      <initials>G</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-4449-1863">
      <foreName>Tommaso</foreName>
      <lastName>Leonardi</lastName>
      <initials>T</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-6090-3100">
      <foreName>Anton J</foreName>
      <lastName>Enright</lastName>
      <initials>AJ</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
  </PubmedArticle>
  <PubmedArticle pmid="27197225" doi="10.1101/gr.204479.116">
    <year>2016</year>
    <journal>Genome Res.</journal>
(...)
Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statement.
<?xml version="1.0"?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0"
    xmlns:xalan="http://xml.apache.org/xalan"
    xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
    exclude-result-prefixes="xalan str"
 >
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>

<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values(<xsl:variable name="o2" select="@orcid"/>
<xsl:choose>
 <xsl:when test="str:strcmp( $o1 , $o2) < 0">'<xsl:value-of select='$o1'/>','<xsl:value-of select='$o2'/>'</xsl:when>
 <xsl:otherwise>'<xsl:value-of select='$o2'/>','<xsl:value-of select='$o1'/>'</xsl:otherwise>
</xsl:choose>);
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

This stylesheet contains an extension 'strmcp' for the xslt processor xalan to compare two XML strings
This extension is just used to always be sure that the field "orcid1" in the table "collab" is always lower than "orcid2" to avoid duplicates pairs.
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml

create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)
and those sql statetements are loaded into sqlite3:
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml |\
 sqlite3 orcid.sqlite

The next step is to produce a gexf+xml file to play with the orcid graph in gephi.
I use the following bash script to convert the sqlite3 database to gexf+xml.
DB=orcid.sqlite

cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">

<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF

sqlite3 -separator ' ' -noheader  ${DB} 'select orcid,name,affiliation from author' |\
 sed  -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g"  -e 's/"/\"/g' |\
 awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'

echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader  ${DB} 'select orcid1,orcid2 from collab' |\
 awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"



The output is saved and then loaded into gephi.






That's it,

Pierre

02 February 2015

Listing the 'Subject' Sequences in a BLAST database using the NCBI C++ toolbox. My notebook.

In my previous post (http://plindenbaum.blogspot.com/2015/01/filtering-fasta-sequences-using-ncbi-c.html) I've built an application filtering FASTA sequences using the
NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna write a tool listing the 'subject' sequences in a BLAST database.

This new application ListBlastDatabaseContent takes only one argument '-db', the name of the blast database :

void ListBlastDatabaseContent::Init()
    {
    (...)
    /* argument for name */
    arg_desc->AddDefaultKey(
            "db",/* name */
            "database",/* synopsis */
            "Blast database",/* comment */
            CArgDescriptions::eString, /* argument type */
            "" /* default value*/
            );
    (...)
    }   

in ListBlastDatabaseContent::Run() we get the name of the database

(...)
string database_name(  this->GetArgs()["db"].AsString() );
(...)

and we open a new CSearchDatabase object ("Blast Search Subject")

CSearchDatabase* searchDatabase = new CSearchDatabase(
    database_name, /* db name */
    CSearchDatabase::eBlastDbIsNucleotide /* db type */
    );

We get a handle to the seqdb, "it defines access to the database component by calling methods on objects which represent the various database files, such as the index, header, sequence, and alias files".

CRef<CSeqDB> seqdb= searchDatabase->GetSeqDb();

We get an iterator to the seqdb database http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDBIter.html:

CSeqDBIter  iter = seqdb->Begin();

While this iterator is valid, we retrieve and print the DNA sequence and the header associated to the iterator.

while(iter)
    {
    string output;
    TSeqRange range; /* entire sequence */
    seqdb->GetSequenceAsString(iter.GetOID(),output,range); 
    CRef< CBlast_def_line_set > def_line= seqdb->GetHdr(iter.GetOID());
    CBlast_def_line_set:: Tdata::const_iterator r = def_line->Get().begin();
    while(r!= def_line->Get().end())
        {
        cout << ">" << (*r)->GetTitle()  << endl;
        cout << output << endl;
        ++r;
        }
    ++iter;
    }

All in one:

#include <memory>
#include <limits>
#include "corelib/ncbiapp.hpp"
#include <algo/blast/api/blast_types.hpp>
#include <algo/blast/api/sseqloc.hpp>
#include <algo/blast/api/blast_aux.hpp>
#include <algo/blast/api/blast_options_handle.hpp>
#include <algo/blast/api/blast_results.hpp>
#include <algo/blast/api/local_blast.hpp>
#include <ctype.h>



USING_NCBI_SCOPE;
using namespace blast;

/**
 *  Name
 *      ListBlastDatabaseContent
 *  Motivation:
 *      Print content of blast database
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class ListBlastDatabaseContent : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        ListBlastDatabaseContent()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                __FILE__
                );
            
         
            /* argument for name */
           arg_desc->AddDefaultKey(
                    "db",/* name */
                    "database",/* synopsis */
                    "Blast database",/* comment */
                    CArgDescriptions::eString, /* argument type */
                    "" /* default value*/
                    );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~ListBlastDatabaseContent()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            string database_name(  this->GetArgs()["db"].AsString() );
            /* Blast Search Subject.  http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSearchDatabase.html#details */
            CSearchDatabase* searchDatabase = 0;
            
        
            /* initialize search database */
            searchDatabase = new CSearchDatabase(
                database_name, /* db name */
                CSearchDatabase::eBlastDbIsNucleotide /* db type */
                );
            
            /* This class provides the top-level interface class for BLAST database users. It 
            defines access to the database component by calling methods on objects which represent
            the various database files, such as the index, header, sequence, and alias files. 
            http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDB.html */
            CRef<CSeqDB> seqdb= searchDatabase->GetSeqDb();                 
            
            /* Small class to iterate over a seqdb database. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDBIter.html#details */
            CSeqDBIter  iter = seqdb->Begin();
            
            while(iter)
                {
                /* we put the sequence here */
                 string output;
                 /* retrieve the entire sequence */
                 TSeqRange range;
                /* This method gets the sequence data for an OID, converts it to a human-readable encoding (either Iupacaa for protein, or Iupacna for nucleotide), and returns it in a string. This is equivalent to calling the three-argument versions of this method with those encodings. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDB.html */
                seqdb->GetSequenceAsString(iter.GetOID(),output,range); 
                /* Represents ASN.1 type Blast-def-line-set defined in file blastdb.asn http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCBlast__def__line__set.html */
                CRef< CBlast_def_line_set > def_line= seqdb->GetHdr(iter.GetOID());
                CBlast_def_line_set:: Tdata::const_iterator r = def_line->Get().begin();
                while(r!= def_line->Get().end())
                    {
                    cout << ">" << (*r)->GetTitle()  << endl;
                    cout << output << endl;
                    ++r;
                    }

                ++iter;
                }
            if( searchDatabase != 0) delete searchDatabase;
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return ListBlastDatabaseContent().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "loadblastdb" /* Specify application name */
        );
    }

Testing

get some fasta sequences, remove the headers and create a blast database.

$ curl  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=281381068,281381067,281381066,281381065&rettype=fasta" |\
        awk '/^>/ {printf(">%d\n",NR);next;} {print;}' > sequences.fa

$ makeblastdb -in sequences.fa -dbtype nucl

run the program:

$ loadblastdb -db sequences.fa
>1
GATCGCGGGTATCTTCGATGTAGGGCGAGTCGCGTCACCCTGACCAACCCCTCGAGGTGGTCCTCGTGGAAGGAGGAAGCGCAGTCGACGAAGCGGAAGATGTGGCTGAGCCATCTCGTCGTGTCGGCCGCGAGACGACCCGCGAGTTTCTGCGCGTGTGACTTTGGTAGCGCTACTGCGTAAGCCGAGA
>6
CAGAAACTCGCGGGTCGTCTCGCGGCCGACACGACGAGATGGCTCAGCCACATCTTCCGCTTCGTCGACTGCGCTTCCTCCTTCCACGAGGACCACCTCGAGGGGTTGGTCAGGGTGACGCGACTC
>10
ACCGAAACGGCAAAATTGGCCTTAGTCACGCACACTATCTCGGCTTACGCAGTAGCGCTACCAAAGTCACACGCGCAGAAACTCGCGGGTCGTCTCGCGGCCGACACGACGAGATGGCTCAGCCACATCTTCCGCTTCGTCGACTGCGCTTCCTCCTTCCACGAGGACCACCTCGAGGGGTTGGTCAGGGTGACGCGACTCGCCCTACATCGAAGATACCCGCGATC
>16
GATCGAACGATAAACTTCTGCTATCGGTTCCTCAGCAATCAGTATAACGAGCGGCACATGTATTGCGGCATATTGCAAAAAACATACCTATAAAGAAGGCTAAACTGATTGATTCAAATTAACTTGAGAATACACGATGTTGATGTTCAGTCGAAACAAATCGCTCGCGTATTGTGATC

That's it,
Pierre

30 January 2015

Filtering Fasta Sequences using the #NCBI C++ API. My notebook.

In my previous post (http://plindenbaum.blogspot.com/2015/01/compiling-c-hello-world-program-using.html) I've built a simple "Hello World" application using the
NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna to extend the code in order to create a program filtering FASTA sequences on their sizes.

This new application FastaFilterSize needs three new arguments:

  • '-m' min size of the fasta sequence
  • '-M' max size of the fasta sequence
  • '-v' inverse selection

the method FastaFilterSize::Init() is used to register those optional arguments

/* optional argument for min size */
arg_desc->AddOptionalKey(
    "m",
    "min_size",
    "min size inclusive",
    CArgDescriptions::eInteger
    );
/* optional argument for max size */
arg_desc->AddOptionalKey(
    "M",
    "max_size",
    "max size inclusive",
    CArgDescriptions::eInteger
    );
/* add simple flag to inverse selection */
arg_desc->AddFlag("v", 
    "inverse selection",
    true
    );

the FastaFilterSize::Run() method is the workhorse of the program. Here, we get the output stream that will be used by a CFastaOstream to print the filtered Fasta sequences:

(...)
/* get output stream */
CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
/* create a FASTA writer */
CFastaOstream* writer = new CFastaOstream(out);
(...)

We get the input stream that will be used by a CFastaReader to read the FASTA sequences to be filtered.

/* get input stream */
CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
CFastaReader reader(input_stream ,
        CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
        );

We get the optional values for min-length and max-length:

if (this->GetArgs()["m"])
    {
    min_inclusive = this->GetArgs()["m"].AsInteger();
    }
if (this->GetArgs()["M"])
    {
    max_inclusive = this->GetArgs()["M"].AsInteger();
    }

We get the 'inverse' flag.

bool invert=this->GetArgs()["v"];

While a Fasta sequence can be read, we check its size and we print it if needed:

while( !reader.AtEOF() )
    {
    /* read next sequence */
    CRef<CSeq_entry> seq(reader.ReadOneSeq());
    /* should we print the sequence ?*/
    if((seq->GetSeq().GetLength()>= min_inclusive && 
        seq->GetSeq().GetLength()<= max_inclusive
        )!= invert)
        {
        /* yes, print */
        writer->Write(*seq);
        }
    }

All in one:

#include <memory>
#include <limits>
#include <objtools/readers/fasta.hpp>
#include "corelib/ncbiapp.hpp"
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Delta_ext.hpp>
#include <objects/seq/Delta_seq.hpp>
#include <objects/seq/NCBIeaa.hpp>
#include <objects/seq/IUPACaa.hpp>
#include <objects/seq/IUPACna.hpp>
#include <objects/seq/Seg_ext.hpp>
#include <objects/seq/Seq_annot.hpp>
#include <objects/seq/Seq_descr.hpp>
#include <objects/seq/Seq_ext.hpp>
#include <objects/seq/Seq_hist.hpp>
#include <objects/seq/Seq_inst.hpp>
#include <objects/seq/Seq_literal.hpp>
#include <objects/seq/Seqdesc.hpp>
#include <objects/seq/seqport_util.hpp>
#include <objects/seqalign/Dense_seg.hpp>
#include <objects/seqalign/Seq_align.hpp>
#include <objects/seqloc/Seq_id.hpp>
#include <objects/seqloc/Seq_interval.hpp>
#include <objects/seqloc/Seq_loc.hpp>
#include <objects/seqloc/Seq_loc_mix.hpp>
#include <objects/seqloc/Seq_point.hpp>
#include <objects/seqset/Bioseq_set.hpp>
#include <objects/seqset/Seq_entry.hpp>
#include <objtools/readers/message_listener.hpp>
#include <objtools/readers/line_error.hpp>
#include <objtools/seqmasks_io/mask_reader.hpp>
#include <objtools/seqmasks_io/mask_writer.hpp>
#include <objtools/seqmasks_io/mask_fasta_reader.hpp>
#include <objmgr/util/sequence.hpp>
#include <ctype.h>

USING_NCBI_SCOPE;
USING_SCOPE(objects);

/**
 * FastaFilterSize
 *    
 *  Motivation:
 *      filter sequences on size
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class FastaFilterSize : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        FastaFilterSize()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                "A program filtering fasta sequences on size"
                );
            /* argument for input file */
            arg_desc->AddDefaultKey(
                    "i",/* name */
                    "input_file_name",/* synopsis */
                    "input file name",/* comment */
                    CArgDescriptions::eInputFile, /* argument type */
                    "-" /* default value*/
                    );
           /* argument for output file */
           arg_desc->AddDefaultKey(
                    "o",/* name */
                    "output_file_name",/* synopsis */
                    "output file name",/* comment */
                    CArgDescriptions::eOutputFile, /* argument type */
                    "-" /* default value*/
                    );
            /* optional argument for min size */
            arg_desc->AddOptionalKey(
                    "m",
                    "min_size",
                    "min size inclusive",
                    CArgDescriptions::eInteger
                    );
            /* optional argument for max size */
            arg_desc->AddOptionalKey(
                    "M",
                    "max_size",
                    "max size inclusive",
                    CArgDescriptions::eInteger
                    );
            
            /* add simple flag to inverse selection */
            arg_desc->AddFlag("v", 
                "inverse selection",
                true
                );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~FastaFilterSize()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            /* get output stream */
            CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
            /* create a FASTA writer */
            CFastaOstream* writer = new CFastaOstream(out);
            /* shall we inverse the selection ? */
            bool invert=this->GetArgs()["v"];

            /* get input stream */
            CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
            /* init min and max inclusive */
            TSeqPos min_inclusive=0;
            TSeqPos max_inclusive= std::numeric_limits<TSeqPos>::max();
            /* if min and max set by user */
            if (this->GetArgs()["m"])
                {
                min_inclusive = this->GetArgs()["m"].AsInteger();
                }
            if (this->GetArgs()["M"])
                {
                max_inclusive = this->GetArgs()["M"].AsInteger();
                }
            /* create a FASTA parser */
            CFastaReader reader(input_stream ,
                    CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
                    );
            /* while we can read something */
            while( !reader.AtEOF() )
                {
                /* read next sequence */
                CRef<CSeq_entry> seq(reader.ReadOneSeq());
                
                /* should we print the sequence ?*/
                if((seq->GetSeq().GetLength()>= min_inclusive && 
                    seq->GetSeq().GetLength()<= max_inclusive
                    )!= invert)
                    {
                    /* yes, print */
                    writer->Write(*seq);
                    }
                }
            /* cleanup */
            delete writer;
            out.flush();
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return FastaFilterSize().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "filterfasta" /* Specify application name */
        );
    }

Compiling

g++  -std=gnu++11  -finline-functions -fstrict-aliasing -fomit-frame-pointer    
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/include 
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseM
T/inc  -Wno-deprecated -Wno-deprecated-declarations -o filtersize filtersize.cpp
 -L/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/Release
MT/lib -lblast_app_util -lblastinput-static -lncbi_xloader_blastdb-static -lncbi
_xloader_blastdb_rmt-static -lxblastformat-static -lalign_format-static -ltaxon1
-static -lblastdb_format-static -lgene_info-static -lxalnmgr-static -lblastxml-s
tatic -lblastxml2-static -lxcgi-static -lxhtml-static -lxblast-static -lxalgobla
stdbindex-static -lcomposition_adjustment-static -lxalgodustmask-static -lxalgow
inmask-static -lseqmasks_io-static -lseqdb-static -lblast_services-static -lxobj
util-static -lxobjread-static -lvariation-static -lcreaders-static -lsubmit-stat
ic -lxnetblastcli-static -lxnetblast-static -lblastdb-static -lscoremat-static -
ltables-static -lxalnmgr-static -lncbi_xloader_genbank-static -lncbi_xreader_id1
-static -lncbi_xreader_id2-static -lncbi_xreader_cache-static -lxconnext-static 
-lxconnect-static -ldbapi_driver-static -lncbi_xreader-static -lxconnext-static 
-lxconnect-static -lid1-static -lid2-static -lseqsplit-static -lxcompress-static
 -lxobjmgr-static -lgenome_collection-static -lseqedit-static -lseqset-static -l
seq-static -lseqcode-static -lsequtil-static -lpub-static -lmedline-static -lbib
lio-static -lgeneral-static -lxser-static -lxutil-static -lxncbi-static -lgomp -
lz -lbz2 -ldl -lnsl -lrt -lm -lpthread -L/commun/data/users/lindenb/src/blast2.2
.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_app_util -lblastinput-stati
c -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb_rmt-static -lxblastformat
-static -lalign_format-static -ltaxon1-static -lblastdb_format-static -lgene_inf
o-static -lxalnmgr-static -lblastxml-static -lblastxml2-static -lxcgi-static -lx
html-static -lxblast-static -lxalgoblastdbindex-static -lcomposition_adjustment-
static -lxalgodustmask-static -lxalgowinmask-static -lseqmasks_io-static -lseqdb
-static -lblast_services-static -lxobjutil-static -lxobjread-static -lvariation-
static -lcreaders-static -lsubmit-static -lxnetblastcli-static -lxnetblast-stati
c -lblastdb-static -lscoremat-static -ltables-static -lxalnmgr-static -lncbi_xlo
ader_genbank-static -lncbi_xreader_id1-static -lncbi_xreader_id2-static -lncbi_x
reader_cache-static -lxconnext-static -lxconnect-static -ldbapi_driver-static -l
ncbi_xreader-static -lxconnext-static -lxconnect-static -lid1-static -lid2-stati
c -lseqsplit-static -lxcompress-static -lxobjmgr-static -lgenome_collection-stat
ic -lseqedit-static -lseqset-static -lseq-static -lseqcode-static -lsequtil-stat
ic -lpub-static -lmedline-static -lbiblio-static -lgeneral-static -lxser-static 
-lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -lnsl -lrt -lm -lpthread

Testing

get help

$ ./filtersize  -help
USAGE
  filtersize [-h] [-help] [-xmlhelp] [-i input_file_name]
    [-o output_file_name] [-m min_size] [-M max_size] [-v]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   A program filtering fasta sequences on size

OPTIONAL ARGUMENTS
 (...)
 -i <File_In>
   input file name
   Default = `-'
 -o <File_Out>
   output file name
   Default = `-'
 -m <Integer>
   min size inclusive
 -M <Integer>
   max size inclusive
 -v
   inverse selection
(...)

test with some fasta sequences:

echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20

>lcl|1 seq1
AATCGTACGCTAG
echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20 -v 

>lcl|2 seq2
AATCGTACGATGCAAAAAAAAAAAATCTAG
>lcl|3 seq3
AATC
>lcl|4 seq3
AATC

That's it,
Pierre

29 January 2015

Compiling a C++ 'Hello world' program using the #NCBI C++ toolbox: my notebook.

This post is my notebook for compiling a simple C++ application using the NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/).

This application prints 'Hello world' and takes two arguments:

  • '-o' to specificiy the output filename (default is standard output)
  • '-n' to set the name to be printed (default: "Word !")

The code I used is the one containing in the distribution of blast 2.2.30

The program itself is a C++ Class, it extends the class CNcbiApplication

(...)
class HelloApp : public CNcbiApplication
    {
    (...)
    }

In the constructor, we just set the name and the version of the program:

HelloApp::HelloApp()
    {
    CRef<CVersion> version(new CVersion());
    version->SetVersionInfo(1, 0, 0);
    this->SetFullVersion(version);
    }

The destructor is void

HelloApp::~HelloApp()
    {
    }

an HelloApp::Init() method is used to register the command line arguments.

void HelloApp::Init()
    {
    CArgDescriptions* arg_desc = new CArgDescriptions ;
    arg_desc->SetUsageContext(
        GetArguments().GetProgramBasename(),
        "Hello NCBI"
        );
    (...)

in HelloApp::Init() we register the '-o' argument:

arg_desc->AddDefaultKey(
    "o",/* name */
    "output_file_name",/* synopsis */
    "output file name",/* comment */
    CArgDescriptions::eOutputFile, /* argument type */
    "-" /* default value*/
    );

and the '-n' argument:

arg_desc->AddDefaultKey(
    "n",/* name */
    "name",/* synopsis */
    "Name",/* comment */
    CArgDescriptions::eString, /* argument type */
    "World !" /* default value*/
    );

once the two arguments have been registered, the new command line is pushed in the application:

SetupArgDescriptions( arg_desc );

the HelloApp::Run() method is the workhorse of the program. Here, we get the output stream and we print the name.

int HelloApp::Run()
    {
    /* get output stream */
    CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
    out << "Hello " << this->GetArgs()["n"].AsString() << endl;
    return 0;
    }

All in one:

#include <memory>
#include <limits>
#include "corelib/ncbiapp.hpp"
#include <ctype.h>



USING_NCBI_SCOPE;

/**
 * Hello NCBI
 *      testing NCBI C++ toolbox
 *  Motivation:
 *      filter sequences on size
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class HelloApp : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        HelloApp()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                "Hello NCBI"
                );
            
           /* argument for output file */
           arg_desc->AddDefaultKey(
                    "o",/* name */
                    "output_file_name",/* synopsis */
                    "output file name",/* comment */
                    CArgDescriptions::eOutputFile, /* argument type */
                    "-" /* default value*/
                    );
            /* argument for name */
           arg_desc->AddDefaultKey(
                    "n",/* name */
                    "name",/* synopsis */
                    "Name",/* comment */
                    CArgDescriptions::eString, /* argument type */
                    "World !" /* default value*/
                    );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~HelloApp()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            /* get output stream */
            CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
            out << "Hello " << this->GetArgs()["n"].AsString() << endl;
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return HelloApp().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "hello" /* Specify application name */
        );
    }

Compiling

there is a bunch of libaries available under ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib and it's hard to know which are the one required for linking. I put everything in my command line. All in one it looks like:

g++  -std=gnu++11  -finline-functions -fstrict-aliasing -fomit-frame-pointer    
-Iblast2.2.30/ncbi-blast-2.2.30+-src/c++/include -Iblast2.2.30/ncbi-blast-2.2.30
+-src/c++/ReleaseMT/inc  -Wno-deprecated -Wno-deprecated-declarations -o hello h
ello.cpp -Lblast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_app_util
 -lblastinput-static -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb_rmt-st
atic -lxblastformat-static -lalign_format-static -ltaxon1-static -lblastdb_forma
t-static -lgene_info-static -lxalnmgr-static -lblastxml-static -lblastxml2-stati
c -lxcgi-static -lxhtml-static -lxblast-static -lxalgoblastdbindex-static -lcomp
osition_adjustment-static -lxalgodustmask-static -lxalgowinmask-static -lseqmask
s_io-static -lseqdb-static -lblast_services-static -lxobjutil-static -lxobjread-
static -lvariation-static -lcreaders-static -lsubmit-static -lxnetblastcli-stati
c -lxnetblast-static -lblastdb-static -lscoremat-static -ltables-static -lxalnmg
r-static -lncbi_xloader_genbank-static -lncbi_xreader_id1-static -lncbi_xreader_
id2-static -lncbi_xreader_cache-static -lxconnext-static -lxconnect-static -ldba
pi_driver-static -lncbi_xreader-static -lxconnext-static -lxconnect-static -lid1
-static -lid2-static -lseqsplit-static -lxcompress-static -lxobjmgr-static -lgen
ome_collection-static -lseqedit-static -lseqset-static -lseq-static -lseqcode-st
atic -lsequtil-static -lpub-static -lmedline-static -lbiblio-static -lgeneral-st
atic -lxser-static -lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -lnsl -lr
t -lm -lpthread -Lblast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_a
pp_util -lblastinput-static -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb
_rmt-static -lxblastformat-static -lalign_format-static -ltaxon1-static -lblastd
b_format-static -lgene_info-static -lxalnmgr-static -lblastxml-static -lblastxml
2-static -lxcgi-static -lxhtml-static -lxblast-static -lxalgoblastdbindex-static
 -lcomposition_adjustment-static -lxalgodustmask-static -lxalgowinmask-static -l
seqmasks_io-static -lseqdb-static -lblast_services-static -lxobjutil-static -lxo
bjread-static -lvariation-static -lcreaders-static -lsubmit-static -lxnetblastcl
i-static -lxnetblast-static -lblastdb-static -lscoremat-static -ltables-static -
lxalnmgr-static -lncbi_xloader_genbank-static -lncbi_xreader_id1-static -lncbi_x
reader_id2-static -lncbi_xreader_cache-static -lxconnext-static -lxconnect-stati
c -ldbapi_driver-static -lncbi_xreader-static -lxconnext-static -lxconnect-stati
c -lid1-static -lid2-static -lseqsplit-static -lxcompress-static -lxobjmgr-stati
c -lgenome_collection-static -lseqedit-static -lseqset-static -lseq-static -lseq
code-static -lsequtil-static -lpub-static -lmedline-static -lbiblio-static -lgen
eral-static -lxser-static -lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -l
nsl -lrt -lm -lpthread

Testing

run without arguments:

$ ./hello 
Hello World !

run with arguments:

$ ./hello -n Pierre
Hello Pierre
$ ./hello -n Pierre -o test.txt && cat test.txt 
Hello Pierre

errors:

$ ./hello -x Pierre
USAGE
  hello [-h] [-help] [-xmlhelp] [-o output_file_name] [-n name]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   Hello NCBI

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Unknown argument: "x"

get help:

$ ./hello -help
USAGE
  hello [-h] [-help] [-xmlhelp] [-o output_file_name] [-n name]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   Hello NCBI

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -xmlhelp
   Print USAGE, DESCRIPTION and ARGUMENTS in XML format; ignore all other
   parameters
 -o <File_Out>
   output file name
   Default = `-'
 -n <String>
   Name
   Default = `World !'
 -logfile <File_Out>
   File to which the program log should be redirected
 -conffile <File_In>
   Program's configuration (registry) data file
 -version
   Print version number;  ignore other arguments
 -version-full
   Print extended version data;  ignore other arguments
 -dryrun
   Dry run the application: do nothing, only test all preconditions

get xml help:

$ ./hello -xmlhelp | xmllint --format -
<?xml version="1.0"?>
<ncbi_application xmlns="ncbi:application" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="ncbi:application ncbi_application.xsd">
  <program type="regular">
    <name>hello</name>
    <version>1.0.0</version>
    <description>Hello NCBI</description>
  </program>
  <arguments>
    <key name="conffile" type="File_In" optional="true">
      <description>Program's configuration (registry) data file</description>
      <synopsis>File_Name</synopsis>
    </key>
    <key name="logfile" type="File_Out" optional="true">
      <description>File to which the program log should be redirected</description>
      <synopsis>File_Name</synopsis>
    </key>
    <key name="n" type="String" optional="true">
      <description>Name</description>
      <synopsis>name</synopsis>
      <default>World !</default>
    </key>
    <key name="o" type="File_Out" optional="true">
      <description>output file name</description>
      <synopsis>output_file_name</synopsis>
      <default>-</default>
    </key>
    <flag name="dryrun" optional="true">
      <description>Dry run the application: do nothing, only test all preconditions</description>
    </flag>
    <flag name="h" optional="true">
      <description>Print USAGE and DESCRIPTION;  ignore all other parameters</description>
    </flag>
    <flag name="help" optional="true">
      <description>Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters</description>
    </flag>
    <flag name="version" optional="true">
      <description>Print version number;  ignore other arguments</description>
    </flag>
    <flag name="version-full" optional="true">
      <description>Print extended version data;  ignore other arguments</description>
    </flag>
    <flag name="xmlhelp" optional="true">
      <description>Print USAGE, DESCRIPTION and ARGUMENTS in XML format; ignore all other parameters</description>
    </flag>
  </arguments>
</ncbi_application>

That's it,
Pierre

12 May 2014

Generating wikipedia semantic links from a pubmed-id

In "Building a biomedical semantic network in Wikipedia with Semantic Wiki Links" (Database . 2012 Mar 20;2012) Benjamin Good & al. introduced the Semantic Wiki Link (SWL):

An SWL is a hyperlink on Wikipedia that allows the editor to explicitly specify the type of relationship between the concept described on the page being edited and the concept that is being linked to (http://en.wikipedia.org/wiki/Template:SWL). These SWLs are implemented using MediaWiki templates.
(...)
any programmer can now write computer programs to parse Wikipedia content for SWLs and import them into third-party tools (e.g. triplestores, etc.)
Example: Phospholamban:
The protein encoded by this gene is found as a pentamer and is a major substrate for the cAMP-dependent protein kinase ({{SWL|type=substrate_for|target=protein kinase A|label=PKA}}) in cardiac muscle.




Using Entrez-Ajax (Loman & al.) and the Wikipedia API, I wrote a HTML+JS interface to accelerate the creation of a semantic SWL wiki-text from a PUBMED-id:


and.. well, that's it,

Pierre

25 October 2013

YES: "Choice of transcripts and software has a large effect on variant annotation"

This post was inspired by Aaron Quinlan's tweet:


Here is an example of a missense mutation found with VCFPredictions, a simple tool I wrote for variant effect prediction.

#CHROM POS ID ALT REF 
1 23710805 rs605468 A G
my tool uses the UCSC knownGene track, here is the context of the variant in the UCSC genome browser. There is one exon for TCEA3 (uc021oig.1) on the reverse strand.


If the base at 23710805 is changed from 'A'→'G' on the forward strand, there will be a non-synonymous variation Y(TAT)→H(CAT) on the reverse strand.


At the NCBI rs605468 is said to be " located in the intron region of NM_003196.1."

VEP cannot find this missense variation:

Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs605468 1:23710805 G - ENSR00001518296 Transcript regulatory_region_variant - - - - - rs605468 GMAF=A:0.1204
rs605468 1:23710805 G ENSG00000204219 ENST00000476978 Transcript intron_variant - - - - - rs605468 GMAF=A:0.1204
rs605468 1:23710805 G ENSG00000204219 ENST00000450454 Transcript intron_variant - - - - - rs605468 GMAF=A:0.1204

(of course, my tool doesn't find some variations found in VEP too)

That's it,
Pierre

24 October 2013

PubMed Commons & Bioinformatics: a call for action


NCBI pubmed Commons/@PubMedCommons is a new system that enables researchers to share their opinions about scientific publications. Researchers can comment on any publication indexed by PubMed, and read the comments of others.
Now that we can add some comments to the papers in pubmed, I suggest to flag the articles to mark the deprecated softwares, databases, hyperlinks using a simple controlled syntax. Here are a few examples: the line starts with '!MOVED' or '!NOTFOUND' and is followed by a URL or/and a list of PMID or/and a quoted comment.

Examples

!MOVED: for http://www.ncbi.nlm.nih.gov/pubmed/8392714 (Rebase/1993) to http://www.ncbi.nlm.nih.gov/pubmed/19846593 (Rebase/2010)
!MOVED PMID:19846593 "A more recent version" 
In http://www.ncbi.nlm.nih.gov/pubmed/19919682 the URL has moved to http://biogps.org.
!MOVED <http://biogps.org> 
I moved the sources of http://www.ncbi.nlm.nih.gov/pubmed/9682060 to github
!MOVED <https://github.com/lindenb/cloneit/tree/master/c> 
!NOTFOUND: for http://www.ncbi.nlm.nih.gov/pubmed/9545460 ( Biocat EXCEL template ) url http://www.ebi.ac.uk/biocat/biocat.html returns a 404.
!NOTFOUND "The URL http://www.ebi.ac.uk/biocat/biocat.html was not found " 


That's it,

Pierre

26 September 2013

Presentation: Advanced NCBI

Here is my presentation for the course "Advanced NCBI", I gave yesterday at the University of Nantes.


18 July 2013

Inside the Variation toolkit: annotating a VCF with the data of NCBI biosystems mapped to BED.

Let's annotate a VCF file with the data from the NCBI biosystem.

First the 'NCBI biosystem' data are mapped to a BED file using the following script. It joins "ncbi;biosystem2gene", "ncbi:biosystem-label" and "biomart-ensembl:gene"


It produces a tabix-inded BED mapping the data of 'NCBI biosystem':

$ gunzip -c ncbibiosystem.bed.gz | head
1 69091 70008 79501 106356 30 Signaling_by_GPCR
1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway
1 69091 70008 79501 119548 40 GPCR_downstream_signaling
1 69091 70008 79501 477114 30 Signal_Transduction
1 69091 70008 79501 498 40 Olfactory_transduction
1 69091 70008 79501 83087 60 Olfactory_transduction
1 367640 368634 26683 106356 30 Signaling_by_GPCR
1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway
1 367640 368634 26683 119548 40 GPCR_downstream_signaling
1 367640 368634 26683 477114 30 Signal_Transduction

I wrote a tool named VCFBed: it takes as input a VCF and annotate it with the data of a tabix-ed BED. This tool is available on github at https://github.com/lindenb/jvarkit#vcfbed. Let's annotate a remote VCF with ncbibiosystem.bed.gz :
java -jar dist/vcfbed.jar \
   TABIXFILE=~/ncbibiosystem.bed.gz \
   TAG=NCBIBIOSYS \
   FMT='($4|$5|$6|$7)'|\
grep -E '(^#CHR|NCBI)'

##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40

That's it,

Pierre

09 July 2013

Mapping the annotations of a query sequence on a BLAST hit, my notebook.

This post is the answer to my own question on biostar "BLASTN / TBLASTN : mapping the features of the query to the hit.". I wrote a java program to map the annotations of a sequence to the Hit of a Blast result. The tool is available on github at https://github.com/lindenb/jvarkit.
For example, say you want to map the features of the Uniprot record for Rotavirus NSP3 (http://www.uniprot.org/uniprot/P04514) on Genbank http://www.ncbi.nlm.nih.gov/nuccore/AY065842.1 (RNA).

Download P04514 as XML

TblastN P04514(protein) vs AY065842 (RNA) and save the BLAST result as XML:

<BlastOutput_program>tblastn</BlastOutput_program>
  <BlastOutput_version>TBLASTN 2.2.28+</BlastOutput_version>
(...)
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|18139606|gb|AY065842.1|</Hit_id>
  <Hit_def>AY065842</Hit_def>
  <Hit_accession>AY065842</Hit_accession>
  <Hit_len>1078</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>546.584</Hsp_bit-score>
      <Hsp_score>1407</Hsp_score>
      <Hsp_evalue>0</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>313</Hsp_query-to>
      <Hsp_hit-from>26</Hsp_hit-from>
      <Hsp_hit-to>964</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>2</Hsp_hit-frame>
      <Hsp_identity>260</Hsp_identity>
      <Hsp_positive>260</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>313</Hsp_align-len>
      <Hsp_qseq>MLKMESTQQMASSIINTSFEAAVVAATSTLELMGIQYDYNEIYTRVKSKFDYVMDDSGVKNNLLGKAATIDQALNGKFGSVMRNKNWMTDSRTVAKLDEDVNKLRMMLSSKGIDQKMRVLNACFSVKRIPGKSSSVIKCTRLMKDKIERGAVEVDDSFVEEKMEVDTVDWKSRYDQLERRFESLKQRVNEKYTTWVQKAKKVNENMYSLQNVISQQQNQIADLQNYCSKLEADLQNKVGSLVSSVEWYLKSMELPDEVKTDIEQQLNSIDTISPINAIDDLEILIRNLIHDYDRTFLMFKGLLRQCNYEYAYE</Hsp_qseq>
      <Hsp_hseq>MLKMESTQQMASSIINSSFEAAVVAATSTLELMGIQYDYNEVYTRVKSKFDLVMDDSGVKNNLIGKAITIDQALNGKFSSAIRNRNWMTDSRTVAKLDEDVNKLRIMLSSKGIDQKMRVLNACFSVKRIPGKSSSIVKCTRLMKDKLERGEVEVDDSFVEEKMEVDTIDWKSRYEQLEKRFESLKHRVNEKYNHWVLKARKVNENMNSLQNVISQQQAHINELQMYNNKLERDLQSKIGSVVSSIEWYLRSMELSDDVKSDIEQQLNSIDQLNPVNAIDDFESILRNLISDYDRLFIMFKGLLQQCNYTYTYE</Hsp_hseq>
      <Hsp_midline>MLKMESTQQMASSIIN SFEAAVVAATSTLELMGIQYDYNE YTRVKSKFD VMDDSGVKNNL GKA TIDQALNGKF S  RN NWMTDSRTVAKLDEDVNKLR MLSSKGIDQKMRVLNACFSVKRIPGKSSS  KCTRLMKDK ERG VEVDDSFVEEKMEVDT DWKSRY QLE RFESLK RVNEKY  WV KA KVNENM SLQNVISQQQ  I  LQ Y  KLE DLQ K GS VSS EWYL SMEL D VK DIEQQLNSID   P NAIDD E   RNLI DYDR F MFKGLL QCNY Y YE</Hsp_midline>
    </Hsp>
   (..)
   
Now generate a BED file to map the features of P04514 on AY065842:
$ java -jar dist/blastmapannots.jar I=P04514.xml B=blast.xml

AY065842 25 961 Non-structural_protein_3 943 + 25961 255,255,255 1 936 25
AY065842 34 469 RNA-binding 970 + 34 469 255,255,255 1 435 34
AY065842 472 640 Dimerization 947 + 472 640 255,255,255 1 168 472
AY065842 532 724 Interaction_with_ZC3H7B 917 + 532 724 255,255,255 1 192 532
AY065842 646 961 Interaction_with_EIF4G1 905 + 646 961 255,255,255 1 315 646
AY065842 520 733 coiled-coil_region 916 + 520 733 255,255,255 1 213 520

That's it,

Pierre

01 May 2013

Inserting the result of a BLAST into a Database using XSLT.

Here is the XML output of a BLAST:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>tblastn</BlastOutput_program>
  <BlastOutput_version>TBLASTN 2.2.27+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Z
hang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation 
of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>52385</BlastOutput_query-ID>
  <BlastOutput_query-def>myseq</BlastOutput_query-def>
  <BlastOutput_query-len>30</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>52385</Iteration_query-ID>
  <Iteration_query-def>myseq</Iteration_query-def>
  <Iteration_query-len>30</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|110624327|dbj|AK225891.1|</Hit_id>
  <Hit_def>Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02</Hit_def>
  <Hit_accession>AK225891</Hit_accession>
  <Hit_len>1829</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>62.3882</Hsp_bit-score>
      <Hsp_score>150</Hsp_score>
      <Hsp_evalue>4.01658e-10</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>30</Hsp_query-to>
      <Hsp_hit-from>250</Hsp_hit-from>
      <Hsp_hit-to>339</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>30</Hsp_identity>
      <Hsp_positive>30</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>30</Hsp_align-len>
      <Hsp_qseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_qseq>
      <Hsp_hseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_hseq>
      <Hsp_midline>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>2</Hit_num>
  <Hit_id>gi|6176337|gb|AF188530.1|AF188530</Hit_id>
  <Hit_def>Homo sapiens ubiquitous tetratricopeptide containing protein RoXaN mRNA, partial cds</Hit_def&g
t;
  <Hit_accession>AF188530</Hit_accession>
  <Hit_len>2398</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>62.3882</Hsp_bit-score>
      <Hsp_score>150</Hsp_score>
      <Hsp_evalue>4.12279e-10</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>30</Hsp_query-to>
      <Hsp_hit-from>105</Hsp_hit-from>
      <Hsp_hit-to>194</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>3</Hsp_hit-frame>
      <Hsp_identity>30</Hsp_identity>
      <Hsp_positive>30</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>30</Hsp_align-len>
      <Hsp_qseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_qseq>
      <Hsp_hseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_hseq>
      <Hsp_midline>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_midline>
    </Hsp>
  </Hit_hsps>
(...)
We want to insert that XML file into a database. I wrote the following XSLT stylesheet , it transforms the blast-xml into a set of SQL statements for sqlite3.
xsltproc --novalid blast2sqlite.xsl blast.xml 

create table BlastOutput(
(...)


BEGIN TRANSACTION;

insert into BlastOutput(
 program,
 version,
 reference,
 db,
 query_ID,
 query_def,
 query_len
 )
 values (
  'tblastn',
  'TBLASTN 2.2.27+',
  'Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang,Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.',
  'nr',
  '52385',
  'myseq',
  30
  );


insert into Parameters(
 blastOutput_id,
 expect,
 matrix,
 sc_match,
 sc_mismatch,
 gap_open,
 gap_extend,
 filter
 )
select MAX(id),
 10,
 'BLOSUM62',
 NULL,
 NULL,
 11,
 1,
 'L;'
 from BlastOutput;
 


insert into Iteration(
 blastOutput_id,
 iter_num,
 query_id,
 query_def,
 query_len
 )
select MAX(id),
 1,
 '52385',
 'myseq',
 30
from BlastOutput;

insert into Hit(iteration_id,num,hit_id,def,accession,len)
select MAX(id),
 1,
 'gi|110624327|dbj|AK225891.1|',
 'Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02',
 'AK225891',
 1829
from Iteration;
(...)
All in one you can redirect the output to sqlite3.
xsltproc --novalid blast2sqlite.xsl blast.xml |\
 sqlite3 input.db
and query the database:
$ sqlite3 -header -line input.sqlite \
 'select * from Hsp,Hit where Hsp.hit_id=Hit.id limit 2'

          id = 1
      hit_id = 1
         num = 1
   bit_score = 62.3882
       score = 150.0
      evalue = 4.01658e-10
  query_from = 1
    query_to = 30
    hit_from = 250
      hit_to = 339
 query_frame = 0
   hit_frame = 1
    identity = 30
    positive = 30
        gaps = 0
   align_len = 30
        qseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
        hseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
     midline = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
          id = 1
iteration_id = 1
         num = 1
      hit_id = gi|110624327|dbj|AK225891.1|
         def = Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02
   accession = AK225891
         len = 1829

          id = 2
      hit_id = 2
         num = 1
   bit_score = 62.3882
       score = 150.0
      evalue = 4.12279e-10
  query_from = 1
    query_to = 30
    hit_from = 105
      hit_to = 194
 query_frame = 0
   hit_frame = 3
    identity = 30
    positive = 30
        gaps = 0
   align_len = 30
        qseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
        hseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
     midline = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
          id = 2
iteration_id = 1
         num = 2
      hit_id = gi|6176337|gb|AF188530.1|AF188530
         def = Homo sapiens ubiquitous tetratricopeptide containing protein RoXaN mRNA, partial cds
   accession = AF188530
         len = 2398
That's it,

Pierre

25 March 2013

Embedding Pubmed, Graphiviz and a remote image in #LaTeX. My notebook. .

I'm learning LaTeX. Today I learned how to create a new command in LaTeX.

\newcommand{name}[num]{definition}
"Basically the command requires two arguments: the name of the command you want to create, and the definition of the command" . I played with LaTeX and wrote the following three commands:

Embedding a remote picture

The following LaTeX document defines a new command "\remoteimage". It takes 3 parameters: a filename, a URL and some parameters for \includegraphics. If the file doesn't exist, the url is downloaded and saved in 'file'. The downloaded image is then included in the final LaTeX document.

Note: latex files must be compiled with --enable-write18 to enable system-calls.
pdflatex --enable-write18 input.tex
Result:

External Image /Latex by lindenb


GraphViz Dot

The second LaTex Document works the same way. It defines a command "\graphviz" , sends the content of the 2nd argument to graphviz dot and save the resulting image before importing it into the LaTeX document.

Result:

GraphViz / Latex by lindenb


Pubmed

The last command define "\pmid" . It needs one Pubmed identifer. It downloads the XML record for this pmid, transforms it to LaTeX with xsltproc and the following XSLT stylesheet:

The LaTeX document includes four pubmed identifiers:

Result:

Pumed / Latex by lindenb






That's it,

Pierre




20 December 2012

RDF/Jena: a simple extension for XSLT/XALAN. Testing with NCBI-Gene

In a previous post, I've shown that the XALAN XSLT engine can be extended with custom function returning a DOM Document that will be used by the xslt-stylesheet. Here, I'll create an extension for XALAN getting some RDF statements from a Jena/RDF model. The RDF model will be loaded in memory but one can imagine to use a persistent model ( TDB or SDB). I'll download a record from NCBI-gene, transform it to html and use the disease-ontology database as RDF to annotate it.

A Gene record is downloaded as XML from NCBI gene:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=4853&retmode=xml" > notch2.html
The disease ontology is downloaded as RDF/XML:
curl -odoid.owl "http://www.berkeleybop.org/ontologies/doid.owl"

The XSLT Stylesheet

The stylesheet declares the extension jena, loads the RDF model ("$model"), searches for the OMIM identifiers in the Gene record and loads the RDF statements related to that OMIM-ID.
For example the following xpath expression:
jena:query(
   $model,
   $doiid,
   'http://www.geneontology.org/formats/oboInOwl#hasExactSynonym',
   ''
   )
returns a rdf/XML document containing the RDF statements having a subject=$doiid, a property "http://www.geneontology.org/formats/oboInOwl#hasExactSynonym" and any object.
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Statement>
    <rdf:subject rdf:resource="http://purl.obolibrary.org/obo/DOID_0050721"/>
    <rdf:predicate rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasExactSynonym"/>
    <rdf:object>Phosphoserine phosphatase deficiency</rdf:object>
  </rdf:Statement>
</rdf:RDF>
The stylesheet:

The Java code

This is the java extension: the constructor loads the RDF model in memory. The function query(..) returns a RDF/XML document matching the query.

Makefile




config.mk:

Result

java -cp ${class.path} org.apache.xalan.xslt.Process \
 -IN notch2.xml \
 -XSL gene2html.xsl -EDUMP -OUT result.html


NOTCH2

Omim ID 610205

Label
Alagille syndrome
Synonym
Arteriohepatic dysplasia (disorder)
Sub-Class Of
Label
gastrointestinal system disease
Synonym
gastrointestinal disease
Sub-Class Of
Label
disease of anatomical entity
Sub-Class Of
Label
disease



Omim ID 102500

Label
Hajdu-Cheney syndrome
Synonym
Hajdu-Cheney syndrome (disorder)
Sub-Class Of
Label
autosomal dominant disease
Sub-Class Of
Label
autosomal genetic disease
Sub-Class Of
Label
monogenic disease
Sub-Class Of
Label
genetic disease
Sub-Class Of
Label
disease









That's it,


Pierre


14 October 2012

Calculating time from submission to publication / Degree of burden in submitting a paper

After "404 not found": a database of non-functional resources in the NAR database collection, I've uploaded my second dataset on figshare:
Calculating time from submission to publication / Degree of burden in submitting a paper
.

Calculating time from submission to publication / Degree of burden in submitting a paper. Pierre Lindenbaum,  Ryan Delahanty.
figshare.
Retrieved 10:13, Oct 14, 2012 (GMT)
http://dx.doi.org/10.6084/m9.figshare.96403

This dataset was inspired by this post on biostar, initialy asked by Ryan Delahanty: I was wondering if it would be possible to calculate some kind of a metric for the speed-of-publication for each journal. I'm not sure submitted and accepted dates are available for all papers, but I noticed in XML data there are fields like the following:
<PubmedData>
        <History>
            <PubMedPubDate PubStatus="received">
                <Year>2011</Year>
                <Month>11</Month>
                <Day>29</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="accepted">
                <Year>2011</Year>
                <Month>12</Month>
                <Day>20</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
           (...)

In this dataset, the script 'pubmed.sh" downloads the the journals from http://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.pubmedhelptable45/ , the 'eigenfactors' from http://www.eigenfactor.org.

For each journal , It scans pubmed (starting from year=2000) and get the difference between the date[@PubStatus='received'] and the date[@PubStatus='accepted'].

titleissneigenfactordays
"Acta biochimica Polonica"0001-527X0.003996119.770935960591
"Acta biomaterialia"1742-70610.02152129.682692307692
"Acta biotheoretica"0001-53420.000844161.897058823529
"Acta cirurgica brasileira / Sociedade Brasileira para Desenvolvimento Pesquisa em Cirurgia"0102-86500.00128122.038461538462
"Acta cytologica"0001-55470.00230565.3006134969325
"Acta diabetologica"0940-54290.001851299.6
"Acta haematologica"0001-57920.002825118.654676258993
"Acta histochemica"0065-12810.002162110.471204188482
"Acta histochemica et cytochemica"0044-59910.00067781.6455696202532
"Acta neurochirurgica"0001-62680.009685204.371830985916
"Acta neuropathologica"0001-63220.02347169.7277882797732
"Acta theriologica"0001-70510.000901147.0
"Acta tropica"0001-706X0.01011196.577777777778
"Acta veterinaria Scandinavica"0044-605X0.00161282.0
"Addictive behaviors"0306-46030.017915163.049731182796
"Advances in space research "0273-11770.021217205.0
Ambio0044-74470.007463181.878048780488
"American journal of human genetics"0002-92970.12015667.1898928024502
"American journal of hypertension"0895-70610.017359104.074576271186
(....)

Here is the kind of figure I got:

As far as I remember, "Cell" is the point having the highest eigenfactor.


Note: pubmed contains some errors: e.g. received > accepted (http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20591334&retmode=xml) or some dates in the future: ( http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12921703&retmode=xml )


That's it,

Pierre

05 September 2012

Customizing the java classes for the NCBI generated by XJC

Reminder: XJC is the Java XML Binding Compiler. It automates the mapping between XML documents and Java objects:

XSD (aka XML-schema)
+
XJC
=
JAVA Classes

The code generated by XJC allows to :
  • Unmarshal XML content into a Java representation
  • Access and update the Java representation
  • Marshal the Java representation of the XML content into XML content

For example, the following XML-Schema (tinyseq.xsd) describes a TinySeq-XML document returned by the NCBI.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:annotation>
    <xs:documentation> XML schema for NCBI tinyseq format</xs:documentation>
  </xs:annotation>
  <xs:complexType name="TSeqSet_t">
    <xs:annotation>
      <xs:documentation>Set of sequences</xs:documentation>
    </xs:annotation>
    <xs:sequence>
      <xs:element ref="TSeq" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
  <xs:complexType name="TSeq_t">
    <xs:annotation>
      <xs:documentation>A Tiny Sequence</xs:documentation>
    </xs:annotation>
    <xs:sequence>
      <xs:element name="TSeq_seqtype">
        <xs:complexType>
          <xs:attribute name="value">
            <xs:simpleType>
              <xs:restriction base="xs:string">
                <xs:enumeration value="nucleotide"/>
                <xs:enumeration value="protein"/>
              </xs:restriction>
            </xs:simpleType>
          </xs:attribute>
        </xs:complexType>
      </xs:element>
      <xs:element name="TSeq_gi" type="xs:long"/>
      <xs:element name="TSeq_accver" type="xs:string"/>
      <xs:element name="TSeq_sid" type="xs:string"/>
      <xs:element name="TSeq_taxid" type="xs:long"/>
      <xs:element name="TSeq_orgname" type="xs:string"/>
      <xs:element name="TSeq_defline" type="xs:string"/>
      <xs:element name="TSeq_length" type="xs:nonNegativeInteger"/>
      <xs:element name="TSeq_sequence" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
  <xs:element name="TSeqSet" type="TSeqSet_t"/>
  <xs:element name="TSeq" type="TSeq_t"/>
</xs:schema>

This xml-schema can be compiled with XJC:

${JAVA_HOME}/bin/xjc -d . -p generated tinyseq.xsd
parsing a schema...
compiling a schema...
generated/ObjectFactory.java
generated/TSeqSetT.java
generated/TSeqT.java

$ more generated/TSeqT.java

package generated;
(...)
@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "TSeq_t", propOrder = {  "tSeqSeqtype",  "tSeqGi",  "tSeqAccver","tSeqSid",(...),"tSeqSequence"})
public class TSeqT {
    @XmlElement(name = "TSeq_seqtype", required = true)
    protected TSeqT.TSeqSeqtype tSeqSeqtype;
    @XmlElement(name = "TSeq_gi")
    protected long tSeqGi;
    @XmlElement(name = "TSeq_accver", required = true)
    protected String tSeqAccver;
    @XmlElement(name = "TSeq_sid", required = true)
    protected String tSeqSid;
    @XmlElement(name = "TSeq_taxid")
    protected long tSeqTaxid;
    @XmlElement(name = "TSeq_orgname", required = true)
    protected String tSeqOrgname;
    @XmlElement(name = "TSeq_defline", required = true)
    protected String tSeqDefline;
    @XmlElement(name = "TSeq_length", required = true)
    @XmlSchemaType(name = "nonNegativeInteger")
    protected BigInteger tSeqLength;
    @XmlElement(name = "TSeq_sequence", required = true)
    protected String tSeqSequence;
(...)
    }

But XJC doesn't know how to generate some classical java functions like 'hashCode', 'equals' or 'toString' or to add some custom methods to your classes.

Hopefully the standard distribution of XJC comes with a plugin named -Xinject-code whch injects some custom code in the classes generated by XJC.

XSD (aka XML-schema)
+
java xml binding (jxb)
+
XJC
=
Customized JAVA Classes

For example, if we want to add a toString method to the class TSeqT, we're going to write the following "java xml binding file" (jxb) which alters the initial xml schema:

<?xml version="1.0" encoding="UTF-8"?>
<jxb:bindings 
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xjc="http://java.sun.com/xml/ns/jaxb/xjc"
xmlns:jxb="http://java.sun.com/xml/ns/jaxb"
xmlns:ci="http://jaxb.dev.java.net/plugin/code-injector"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
jxb:extensionBindingPrefixes="ci "
jxb:version="2.1"

>

<jxb:bindings schemaLocation="tinyseq.xsd">
        <!-- here we use an XPATH expression to tell xjc about which part
 of the XML schema we want to change -->
 <jxb:bindings node="/xs:schema/xs:complexType[@name='TSeq_t']">
  <ci:code>

 /** toString : returns the gi and the defline  */
 public String toString()
  {
  return "gi:"+getTSeqGi()+"|"+getTSeqDefline();
  }

</ci:code>
 </jxb:bindings>
</jxb:bindings>

</jxb:bindings>

Below, I wrote a larger JXB file 'tinyseq.jxb' which injects the following methods:
  • 'equals' method for TSeq
  • 'hashCode' method for TSeq
  • 'toString' method for TSeq
  • 'printAsFasta' method for TSeq
  • 'getTSeqSetbyId' method for TSeqSet. A static function fetching a TinySeq sequence from the NCBI for a given 'gi'
  • a 'main' method for TSeqSet. It loops over a list of 'gi's, fetches the sequences (using NCBI-EFetch) and prints the sequences as FASTA
<?xml version="1.0" encoding="UTF-8"?>
<jxb:bindings 
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xjc="http://java.sun.com/xml/ns/jaxb/xjc"
xmlns:jxb="http://java.sun.com/xml/ns/jaxb"
xmlns:ci="http://jaxb.dev.java.net/plugin/code-injector"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
jxb:extensionBindingPrefixes="ci "
jxb:version="2.1"

>

<jxb:bindings schemaLocation="tinyseq.xsd">
 <jxb:bindings node="/xs:schema/xs:complexType[@name='TSeq_t']">
  <ci:code>
 /* print this sequence as fasta */
 public void printAsFasta(java.io.PrintStream out)
  {
  String s=getTSeqSequence();
  out.print(">gi:"+getTSeqGi()+"|"+ getTSeqAccver() +"|"+getTSeqDefline());
  for(int i=0;i &lt; s.length();++i)
   {
   if(i%60==0) out.println();
   out.print(s.charAt(i));
   }
  out.println();
  }

 /** equals: two  TSeq are equal if they have the same gi */
 @Override
 public boolean equals(Object o)
  {
  if(o==this) return true;
  if(o==null || o.getClass()!=this.getClass()) return false;
  return this.getTSeqGi()==TSeqT.class.cast(o).getTSeqGi();
  }

 /** hashCode : use gi */
 @Override
 public int hashCode()
  {
  return  (int)(this.getTSeqGi()^(this.getTSeqGi()>>>32));
  }

 /** toString : returns the gi and the defline  */
 public String toString()
  {
  return "gi:"+getTSeqGi()+"|"+getTSeqDefline();
  }

</ci:code>
 </jxb:bindings>




 <jxb:bindings node="/xs:schema/xs:complexType[@name='TSeqSet_t']">
  <ci:code>
 /** get TSeqSetT from a given gi */
 public static TSeqSetT getTSeqSetbyId(long gi)
  throws javax.xml.bind.JAXBException, javax.xml.bind.UnmarshalException , java.io.IOException
  {
  /** find the JAXB context in the defined path */
  javax.xml.bind.JAXBContext jc = javax.xml.bind.JAXBContext.newInstance(TSeqSetT.class,TSeqT.class);
  javax.xml.bind.Unmarshaller u = jc.createUnmarshaller();
  String uri="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&amp;rettype=fasta&amp;retmode=xml&amp;id="+gi;
  /** read the sequence */
  return u.unmarshal(new javax.xml.transform.stream.StreamSource(uri),TSeqSetT.class).getValue();
  }

 /** main: takes a list of gi and prints the sequences as fasta */
 public static void main(String args[]) throws Exception
  {
  for(int optind=0;optind &lt; args.length;++optind)
   {
   TSeqSetT tss=TSeqSetT.getTSeqSetbyId(Long.parseLong(args[optind]));
   for(generated.TSeqT seq:tss.getTSeq()) seq.printAsFasta(System.out);
   }
  }

</ci:code>

</jxb:bindings>

</jxb:bindings>

</jxb:bindings>

Compile the schema, compile the java classes and execute

$ xjc -target 2.1 -verbose -Xinject-code -extension -d . -p generated -b tinyseq.jxb tinyseq.xsd
parsing a schema...
compiling a schema...
[INFO] generating code
unknown location

generated/ObjectFactory.java
generated/TSeqSetT.java
generated/TSeqT.java

$ javac generated/*.java
$ java generated.TSeqSetT 25 26 27
>gi:25|X53813.1|Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTG
GGGGTCCAGCCATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCT
AAAGAAATCACATTAGGATTCTCTTTTTAAGCTGTTCCTTAAAACACTAGAGTCTTAGAA
ATCTATTGGAGGCAGAAGCAGTCAAGGGTAGCCTAGGGTTAGGGTTAGGCTTAGGGTTAG
GGTTAGGGTACGGCTTAGGGTACTGTTTCGGGGAGGGGTTCAGGTACGGCGTAGGGTATG
GGTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGCTCGGTTTAGGGTACGGGTTAG
GATTAGGGTACGTGTTAGGGTTAGGGTAGGGCTTAGGGTTAGGGTACGTGTTAGGGTTAG
GG
>gi:26|X53814.1|Blue Whale heavy satellite DNA
TAGTTATTAAACCTATCCCACTCTCTAGATACACCTTAGCACGTAAAGGAATATTATTTG
GGGGTCCAGACATGGAGAAGAGTTTAGACACTAGGATAAGATAAGGAACACACCCATTCT
AAAGAAATCACATTAGGATTCTCTTTTTAAGCTGTTCCTTAAAACTCTAGTGCTTAGGAA
ATCTATTGGAGGCAGAAGCAGTCAAGGGTAGCCTAGGGTTAGGGTTAGGCTTATGGTTAG
GGCTAGGGTACGGCTTAGGGTACGGATTCGGGGAGGGGTTCGGGTACGGCGTAGGGTATG
GGTTAGGGTTAGCGTTAGTGTTAGGGTTAGGGCTCGGTTTAGGGTACGGGTTAGGATTAG
GGTACGTGTTAGGGTTAGGGTAGGGGTTAGGGTTAGGGTACGCGTTAGGGTTAGGG
>gi:27|Z18633.1|B.physalus gene for large subunit rRNA
AACCAGTATTAGAGCACTGCCTGCCCGGTGACTAATCGTTAAACGGCCGCGGTATCCTGA
CCGTGCAAAGGTAGCATAATCACTTGTTCTCTAATTAGGGACTTGTATGAATGGCCACAC
GAGGGTTTTACTGTCTCTTACTTTTAATCAGTGAAATTGACCTCTCCGTGAAGAGGCGGA
GATAACAAAATAAGACGAGAAGACCCTATGGAGCTTCAATTAATCAACCCAAAAACCATA
ACCTTAAACCACCAAGGGATAACAAAACCTTATATGGGCTGACAATTTCGGTTGGGGTGA
CCTCGGAGTACAAAAAACCCTCCGAGTGATTAAAACTTAGGCCCACTAGCCAAAGTACAA
TATCACTTATTGATCCAATCCTTTGATCAACGGAACAAGTTACCCTAGGGATAACAGCGC
AATCCTATTCTAGAGTCCATATCGACAATAGGGTTTACGACCTCGATGTTGGATCAGGAC
ATCCTAATGGTGCAGCTGCTATTAAGGGTTCGTTTGTT
That's it,


Pierre