Showing posts with label blast. Show all posts
Showing posts with label blast. Show all posts

09 September 2016

Playing with #magicblast, the #NCBI Short read mapper. My notebook

NCBI MAGIC Blast was recently mentioned by BioMickWatson on twitter.



Here, I'll be playing with magicblast and I'll compare its output with bwa (Makefile below).

First, here is an extract of the manual for magicblast.
USAGE
DESCRIPTION
   Short read mapper

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
    * Incompatible with:  sra
 -infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>
   Input format for sequences
   Default = `fasta'
    * Incompatible with:  sra
 -paired
   Input query sequences are paired
 -query_mate <File_In>
   FASTA file with mates for query sequences (if given in another file)
    * Requires:  query
 -sra <String>
   Comma-separated SRA accessions
    * Incompatible with:  query, infmt

 *** Formatting options
 -outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >
   alignment view options:
   sam = SAM format,
   tabular = Tabular format,
   text ASN.1
   Default = `sam'

Indexing the reference for magicblast

I've indexed the human chr22 using the good old NCBI makeblastdb:
$ wget -O "chr22.fa.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz"
$ gunzip -f chr22.fa.gz
$ makeblastdb -in chr22.fa -dbtype nucl -title chr22

Mapping Paired-End reads with magicblast

First I've removed the read names' suffixes '/1' and '/2' because they still appear in the final bam.
gunzip -c R1.aa.fq.gz | sed 's/\/1$$//' | gzip > R1.fq.gz
gunzip -c R2.aa.fq.gz | sed 's/\/2$$//' | gzip > R2.fq.gz
The reads are then mapped with magicblast using the following command:
magicblast -db chr22 \
 -infmt fastq -paired \
 -query R1.fq.gz \
 -query_mate R2.fq.gz |\
 sed 's/gnl\|BL_ORD_ID\|0/chr22/g' |\
 samtools sort -o child.magic.bam -T child.magic --reference chr22.fa -O BAM
As far as I could see, there was no option to specify the read group (RG).

Here,I've transformed the name of the contig because it looks like a blast-id identifier:
I've also mapped the reads using bwa:

bwa mem  chr22.fa R1.fq.gz R2.fq.gz |\
 samtools sort -o child.bwa.bam -T child.bwa --reference chr22.fa -O BAM

BAM headers

Here is the BAM header for magicblast: Magic blast uses the spec v1.2 and has a Group-Order flag.

@HD VN:1.2 GO:query SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:0 PN:magicblast magicblast -db chr22.fa -infmt fastq -paired -query R1.fq.gz -query_mate R2.fq.gz

And the BAM header for bwa:
@HD VN:1.3 SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:bwa PN:bwa VN:0.7.15-r1140 bwa mem chr22.fa R1.fq.gz R2.fq.gz

Samtools samflags

for magicblast:
24387 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24387 + 0 mapped (100.00% : N/A)
24387 + 0 paired in sequencing
12222 + 0 read1
12222 + 0 read2
24330 + 0 properly paired (99.77% : N/A)
24387 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
57 + 0 with mate mapped to a different chr
57 + 0 with mate mapped to a different chr (mapQ>=5)

for bwa:
24001 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1 + 0 supplementary
0 + 0 duplicates
23976 + 0 mapped (99.90% : N/A)
24000 + 0 paired in sequencing
12000 + 0 read1
12000 + 0 read2
23840 + 0 properly paired (99.33% : N/A)
23952 + 0 with itself and mate mapped
23 + 0 singletons (0.10% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Validationg with picard

Validation of child.magic.bam with picard generates many errors:
$ java -jar picard.jar  ValidateSamFile  I=child.magic.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
ERROR: Header version: 1.2 does not match any of the acceptable versions: 1.0, 1.3, 1.4, 1.5
ERROR: Record 4, Read name HWI-1KL149:97:C6Y6VACXX:4:2306:7588:23627, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 6, Read name HWI-1KL149:97:C6Y6VACXX:5:2303:6772:28953, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 15, Read name HWI-1KL149:97:C6Y6VACXX:4:1207:16508:83772, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 17, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 19, Read name HWI-1KL149:97:C6Y6VACXX:5:1315:20109:45547, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 29, Read name HWI-1KL149:97:C6Y6VACXX:4:2313:12478:85202, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 36, Read name HWI-1KL149:97:C6Y6VACXX:4:1202:11437:96159, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 40, Read name HWI-1KL149:97:C6Y6VACXX:4:1213:16611:87818, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 45, Read name HWI-1KL149:97:C6Y6VACXX:4:1208:7944:83271, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 49, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 22, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 25, Read name HWI-1KL149:97:C6Y6VACXX:4:2209:8319:82198, Mate negative strand flag does not match read negative strand flag of mate
(...)
while there is ~no error for child.bwa.bam:
$ java -jar picard.jar  ValidateSamFile  I=child.bwa.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
[Fri Sep 09 16:09:28 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=78643200



Variant Calling

Both BAMs are mapped with samtools/vcftools (min DP=10)
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.bwa.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.bwa.vcf -
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.magic.bam |\
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |\
 bcftools filter -e 'DP<10'  --output-type v -o child.magic.vcf -
Number of variants:
$ grep -v "#" child.magic.vcf  | wc -l
111
$ grep -v "#" child.bwa.vcf  | wc -l
166

Here is the diff for CHROM/POS/REF/ALT:
  • Uniq to BWA: 'comm -13 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 64 variants
  • Uniq to Magic: 'comm -23 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 9 variants
  • Common variants: 'comm -12 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 102 variants

The Makefile

SHELL=/bin/bash
data.dir=${HOME}/src/DATA
ncbi.bin=${HOME}/packages/magicblast/bin
REF=chr22.fa
samtools.exe=${HOME}/packages/samtools/samtools 
bwa.exe=${HOME}/packages/bwa-0.7.15/bwa 

all: child.magic.bam child.bwa.bam

R1.fq.gz : ${HOME}/src/DATA/child.R1.aa.fq.gz 
 gunzip -c $< | sed 's/\/1$$//' | gzip > $@
R2.fq.gz : ${HOME}/src/DATA/child.R2.aa.fq.gz 
 gunzip -c $< | sed 's/\/2$$//' | gzip > $@

child.bwa.bam : ${REF}.bwt R1.fq.gz R2.fq.gz 
 ${bwa.exe} mem  ${REF} $(word 2,$^) $(word 3,$^) |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 


child.magic.bam : ${REF}.nin R1.fq.gz R2.fq.gz 
 ${ncbi.bin}/magicblast -db ${REF} -infmt fastq -paired -query $(word 2,$^) -query_mate $(word 3,$^) |\
 sed 's/gnl\|BL_ORD_ID\|0/$(basename ${REF})/g' |\
 ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM 

${REF}.bwt : ${REF}
 ${bwa.exe} index $<

${REF}.nin : ${REF}
 ${ncbi.bin}/makeblastdb -in $< -dbtype nucl -title $(basename ${REF})

${REF} :
 wget -O "$@.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@.gz"
 gunzip -f $@.gz

That's it,
Pierre

29 June 2015

A BLAST to SAM converter.

Some times ago, I've received a set of Ion-Torrent /mate-reads with a poor quality. I wasn't able to align much things using bwa. I've always wondered if I could get better alignments using NCBI-BLASTN (short answer: no) . That's why I asked guyduche, my intership student to write a C program to convert the output of blastn to SAM. His code is available on github at :

The input for blast2sam is
  • the XML output of NCBI blastn (or stdin)
  • The single or pair of fastq file(s)
  • The reference sequence indexed with picard
.

Example:

fastq2fasta in.R1.fq.gz in.R2.fq.gz |\
blastn -db REFERENCE   -outfmt 5 | \
blast2bam -o result.bam -W 40 -R '@RG   ID:foo  SM:sample' - REFERENCE.dict  in.R1.fq.gz in.R2.fq.gz

Output:
@SQ SN:gi|9629357|ref|NC_001802.1|  LN:9181 
@RG ID:foo  SM:sample
@PG ID:Blast2Bam    PN:Blast2Bam    VN:0.1  CL:../../bin/blast2bam -o results.sam -W 40 -R @RG  ID:foo  SM:sample - db.dict test_1.fastq.gz test_2.fastq.gz
(...)
ERR656485.2 83  gi|9629357|ref|NC_001802.1| 715 60  180S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X6=1S =   715 -119    CCTAGTGTTGCTTGCTTTTCTTCTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAN    (),.((((,(((((,((.((.-(>69>20E>6/=>5EC@9-52?BEE::2951.)74B64=B==FFAF=A??59:>FFFDF:55GGFGF?DFGGFE868>GGGFGGGGED;FGFFGGGGGGGGGGGEFFGE9GGGGFGGGGGGGGDGECGGFGGGGGGGGGGFGGGGGEGGFGGGGGGFFGGGGGFF?EGGFFFEGGGGGGGGFEGGGEGGGFEGGGGGGGGGGDGFFCEGFGGGGGGGGGGGFFECFGGGGFGGGGGGGGGGGFCGGGGGGGGGGGGGGGGGGFGGGGGGGGF@CCA8!    NM:i:13 RG:Z:foo    AS:i:80 XB:f:148.852    XE:Z:4.07e-39
ERR656485.2 163 gi|9629357|ref|NC_001802.1| 715 60  73S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X8=106S    =   715 119 NAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTCTCATTACAAAAAAAACATACACAATAAATGATATAAGCGGAATCAACAGCATGA    !8A@CGGEFGFGCDFGGGGGGGGGGGGGGFGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGFGGGGGGGGGEGFGFGGGFFGGGGGGGGFGGGGGGGGGGGGFFFFGGGGGG=FFGGFFDGGGGGGGG8FGFGGGGGGGGGFGGGGGGGGGGFDGGFGGFGGGFFFGFF8DFDFDFFFFFFFFFBCDB<@EAFB@ABAC@CDFF?4>EEFE<*>BDAFB@FFBFF>((6<5CC.;C;=D9106(.))).)-46<<))))))))))((,(-)))()((()))    NM:i:13 RG:Z:foo    AS:i:82 XB:f:152.546    XE:Z:3.15e-40
(...)
Now, I would be interested in finding another dataset where this tool could be successfully used.
That's it,
Pierre

02 February 2015

Listing the 'Subject' Sequences in a BLAST database using the NCBI C++ toolbox. My notebook.

In my previous post (http://plindenbaum.blogspot.com/2015/01/filtering-fasta-sequences-using-ncbi-c.html) I've built an application filtering FASTA sequences using the
NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna write a tool listing the 'subject' sequences in a BLAST database.

This new application ListBlastDatabaseContent takes only one argument '-db', the name of the blast database :

void ListBlastDatabaseContent::Init()
    {
    (...)
    /* argument for name */
    arg_desc->AddDefaultKey(
            "db",/* name */
            "database",/* synopsis */
            "Blast database",/* comment */
            CArgDescriptions::eString, /* argument type */
            "" /* default value*/
            );
    (...)
    }   

in ListBlastDatabaseContent::Run() we get the name of the database

(...)
string database_name(  this->GetArgs()["db"].AsString() );
(...)

and we open a new CSearchDatabase object ("Blast Search Subject")

CSearchDatabase* searchDatabase = new CSearchDatabase(
    database_name, /* db name */
    CSearchDatabase::eBlastDbIsNucleotide /* db type */
    );

We get a handle to the seqdb, "it defines access to the database component by calling methods on objects which represent the various database files, such as the index, header, sequence, and alias files".

CRef<CSeqDB> seqdb= searchDatabase->GetSeqDb();

We get an iterator to the seqdb database http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDBIter.html:

CSeqDBIter  iter = seqdb->Begin();

While this iterator is valid, we retrieve and print the DNA sequence and the header associated to the iterator.

while(iter)
    {
    string output;
    TSeqRange range; /* entire sequence */
    seqdb->GetSequenceAsString(iter.GetOID(),output,range); 
    CRef< CBlast_def_line_set > def_line= seqdb->GetHdr(iter.GetOID());
    CBlast_def_line_set:: Tdata::const_iterator r = def_line->Get().begin();
    while(r!= def_line->Get().end())
        {
        cout << ">" << (*r)->GetTitle()  << endl;
        cout << output << endl;
        ++r;
        }
    ++iter;
    }

All in one:

#include <memory>
#include <limits>
#include "corelib/ncbiapp.hpp"
#include <algo/blast/api/blast_types.hpp>
#include <algo/blast/api/sseqloc.hpp>
#include <algo/blast/api/blast_aux.hpp>
#include <algo/blast/api/blast_options_handle.hpp>
#include <algo/blast/api/blast_results.hpp>
#include <algo/blast/api/local_blast.hpp>
#include <ctype.h>



USING_NCBI_SCOPE;
using namespace blast;

/**
 *  Name
 *      ListBlastDatabaseContent
 *  Motivation:
 *      Print content of blast database
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class ListBlastDatabaseContent : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        ListBlastDatabaseContent()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                __FILE__
                );
            
         
            /* argument for name */
           arg_desc->AddDefaultKey(
                    "db",/* name */
                    "database",/* synopsis */
                    "Blast database",/* comment */
                    CArgDescriptions::eString, /* argument type */
                    "" /* default value*/
                    );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~ListBlastDatabaseContent()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            string database_name(  this->GetArgs()["db"].AsString() );
            /* Blast Search Subject.  http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSearchDatabase.html#details */
            CSearchDatabase* searchDatabase = 0;
            
        
            /* initialize search database */
            searchDatabase = new CSearchDatabase(
                database_name, /* db name */
                CSearchDatabase::eBlastDbIsNucleotide /* db type */
                );
            
            /* This class provides the top-level interface class for BLAST database users. It 
            defines access to the database component by calling methods on objects which represent
            the various database files, such as the index, header, sequence, and alias files. 
            http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDB.html */
            CRef<CSeqDB> seqdb= searchDatabase->GetSeqDb();                 
            
            /* Small class to iterate over a seqdb database. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDBIter.html#details */
            CSeqDBIter  iter = seqdb->Begin();
            
            while(iter)
                {
                /* we put the sequence here */
                 string output;
                 /* retrieve the entire sequence */
                 TSeqRange range;
                /* This method gets the sequence data for an OID, converts it to a human-readable encoding (either Iupacaa for protein, or Iupacna for nucleotide), and returns it in a string. This is equivalent to calling the three-argument versions of this method with those encodings. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDB.html */
                seqdb->GetSequenceAsString(iter.GetOID(),output,range); 
                /* Represents ASN.1 type Blast-def-line-set defined in file blastdb.asn http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCBlast__def__line__set.html */
                CRef< CBlast_def_line_set > def_line= seqdb->GetHdr(iter.GetOID());
                CBlast_def_line_set:: Tdata::const_iterator r = def_line->Get().begin();
                while(r!= def_line->Get().end())
                    {
                    cout << ">" << (*r)->GetTitle()  << endl;
                    cout << output << endl;
                    ++r;
                    }

                ++iter;
                }
            if( searchDatabase != 0) delete searchDatabase;
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return ListBlastDatabaseContent().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "loadblastdb" /* Specify application name */
        );
    }

Testing

get some fasta sequences, remove the headers and create a blast database.

$ curl  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=281381068,281381067,281381066,281381065&rettype=fasta" |\
        awk '/^>/ {printf(">%d\n",NR);next;} {print;}' > sequences.fa

$ makeblastdb -in sequences.fa -dbtype nucl

run the program:

$ loadblastdb -db sequences.fa
>1
GATCGCGGGTATCTTCGATGTAGGGCGAGTCGCGTCACCCTGACCAACCCCTCGAGGTGGTCCTCGTGGAAGGAGGAAGCGCAGTCGACGAAGCGGAAGATGTGGCTGAGCCATCTCGTCGTGTCGGCCGCGAGACGACCCGCGAGTTTCTGCGCGTGTGACTTTGGTAGCGCTACTGCGTAAGCCGAGA
>6
CAGAAACTCGCGGGTCGTCTCGCGGCCGACACGACGAGATGGCTCAGCCACATCTTCCGCTTCGTCGACTGCGCTTCCTCCTTCCACGAGGACCACCTCGAGGGGTTGGTCAGGGTGACGCGACTC
>10
ACCGAAACGGCAAAATTGGCCTTAGTCACGCACACTATCTCGGCTTACGCAGTAGCGCTACCAAAGTCACACGCGCAGAAACTCGCGGGTCGTCTCGCGGCCGACACGACGAGATGGCTCAGCCACATCTTCCGCTTCGTCGACTGCGCTTCCTCCTTCCACGAGGACCACCTCGAGGGGTTGGTCAGGGTGACGCGACTCGCCCTACATCGAAGATACCCGCGATC
>16
GATCGAACGATAAACTTCTGCTATCGGTTCCTCAGCAATCAGTATAACGAGCGGCACATGTATTGCGGCATATTGCAAAAAACATACCTATAAAGAAGGCTAAACTGATTGATTCAAATTAACTTGAGAATACACGATGTTGATGTTCAGTCGAAACAAATCGCTCGCGTATTGTGATC

That's it,
Pierre

26 September 2013

Presentation: Advanced NCBI

Here is my presentation for the course "Advanced NCBI", I gave yesterday at the University of Nantes.


09 July 2013

Mapping the annotations of a query sequence on a BLAST hit, my notebook.

This post is the answer to my own question on biostar "BLASTN / TBLASTN : mapping the features of the query to the hit.". I wrote a java program to map the annotations of a sequence to the Hit of a Blast result. The tool is available on github at https://github.com/lindenb/jvarkit.
For example, say you want to map the features of the Uniprot record for Rotavirus NSP3 (http://www.uniprot.org/uniprot/P04514) on Genbank http://www.ncbi.nlm.nih.gov/nuccore/AY065842.1 (RNA).

Download P04514 as XML

TblastN P04514(protein) vs AY065842 (RNA) and save the BLAST result as XML:

<BlastOutput_program>tblastn</BlastOutput_program>
  <BlastOutput_version>TBLASTN 2.2.28+</BlastOutput_version>
(...)
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|18139606|gb|AY065842.1|</Hit_id>
  <Hit_def>AY065842</Hit_def>
  <Hit_accession>AY065842</Hit_accession>
  <Hit_len>1078</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>546.584</Hsp_bit-score>
      <Hsp_score>1407</Hsp_score>
      <Hsp_evalue>0</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>313</Hsp_query-to>
      <Hsp_hit-from>26</Hsp_hit-from>
      <Hsp_hit-to>964</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>2</Hsp_hit-frame>
      <Hsp_identity>260</Hsp_identity>
      <Hsp_positive>260</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>313</Hsp_align-len>
      <Hsp_qseq>MLKMESTQQMASSIINTSFEAAVVAATSTLELMGIQYDYNEIYTRVKSKFDYVMDDSGVKNNLLGKAATIDQALNGKFGSVMRNKNWMTDSRTVAKLDEDVNKLRMMLSSKGIDQKMRVLNACFSVKRIPGKSSSVIKCTRLMKDKIERGAVEVDDSFVEEKMEVDTVDWKSRYDQLERRFESLKQRVNEKYTTWVQKAKKVNENMYSLQNVISQQQNQIADLQNYCSKLEADLQNKVGSLVSSVEWYLKSMELPDEVKTDIEQQLNSIDTISPINAIDDLEILIRNLIHDYDRTFLMFKGLLRQCNYEYAYE</Hsp_qseq>
      <Hsp_hseq>MLKMESTQQMASSIINSSFEAAVVAATSTLELMGIQYDYNEVYTRVKSKFDLVMDDSGVKNNLIGKAITIDQALNGKFSSAIRNRNWMTDSRTVAKLDEDVNKLRIMLSSKGIDQKMRVLNACFSVKRIPGKSSSIVKCTRLMKDKLERGEVEVDDSFVEEKMEVDTIDWKSRYEQLEKRFESLKHRVNEKYNHWVLKARKVNENMNSLQNVISQQQAHINELQMYNNKLERDLQSKIGSVVSSIEWYLRSMELSDDVKSDIEQQLNSIDQLNPVNAIDDFESILRNLISDYDRLFIMFKGLLQQCNYTYTYE</Hsp_hseq>
      <Hsp_midline>MLKMESTQQMASSIIN SFEAAVVAATSTLELMGIQYDYNE YTRVKSKFD VMDDSGVKNNL GKA TIDQALNGKF S  RN NWMTDSRTVAKLDEDVNKLR MLSSKGIDQKMRVLNACFSVKRIPGKSSS  KCTRLMKDK ERG VEVDDSFVEEKMEVDT DWKSRY QLE RFESLK RVNEKY  WV KA KVNENM SLQNVISQQQ  I  LQ Y  KLE DLQ K GS VSS EWYL SMEL D VK DIEQQLNSID   P NAIDD E   RNLI DYDR F MFKGLL QCNY Y YE</Hsp_midline>
    </Hsp>
   (..)
   
Now generate a BED file to map the features of P04514 on AY065842:
$ java -jar dist/blastmapannots.jar I=P04514.xml B=blast.xml

AY065842 25 961 Non-structural_protein_3 943 + 25961 255,255,255 1 936 25
AY065842 34 469 RNA-binding 970 + 34 469 255,255,255 1 435 34
AY065842 472 640 Dimerization 947 + 472 640 255,255,255 1 168 472
AY065842 532 724 Interaction_with_ZC3H7B 917 + 532 724 255,255,255 1 192 532
AY065842 646 961 Interaction_with_EIF4G1 905 + 646 961 255,255,255 1 315 646
AY065842 520 733 coiled-coil_region 916 + 520 733 255,255,255 1 213 520

That's it,

Pierre

01 May 2013

Inserting the result of a BLAST into a Database using XSLT.

Here is the XML output of a BLAST:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>tblastn</BlastOutput_program>
  <BlastOutput_version>TBLASTN 2.2.27+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Z
hang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation 
of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>52385</BlastOutput_query-ID>
  <BlastOutput_query-def>myseq</BlastOutput_query-def>
  <BlastOutput_query-len>30</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>52385</Iteration_query-ID>
  <Iteration_query-def>myseq</Iteration_query-def>
  <Iteration_query-len>30</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|110624327|dbj|AK225891.1|</Hit_id>
  <Hit_def>Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02</Hit_def>
  <Hit_accession>AK225891</Hit_accession>
  <Hit_len>1829</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>62.3882</Hsp_bit-score>
      <Hsp_score>150</Hsp_score>
      <Hsp_evalue>4.01658e-10</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>30</Hsp_query-to>
      <Hsp_hit-from>250</Hsp_hit-from>
      <Hsp_hit-to>339</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>30</Hsp_identity>
      <Hsp_positive>30</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>30</Hsp_align-len>
      <Hsp_qseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_qseq>
      <Hsp_hseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_hseq>
      <Hsp_midline>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>2</Hit_num>
  <Hit_id>gi|6176337|gb|AF188530.1|AF188530</Hit_id>
  <Hit_def>Homo sapiens ubiquitous tetratricopeptide containing protein RoXaN mRNA, partial cds</Hit_def&g
t;
  <Hit_accession>AF188530</Hit_accession>
  <Hit_len>2398</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>62.3882</Hsp_bit-score>
      <Hsp_score>150</Hsp_score>
      <Hsp_evalue>4.12279e-10</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>30</Hsp_query-to>
      <Hsp_hit-from>105</Hsp_hit-from>
      <Hsp_hit-to>194</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>3</Hsp_hit-frame>
      <Hsp_identity>30</Hsp_identity>
      <Hsp_positive>30</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>30</Hsp_align-len>
      <Hsp_qseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_qseq>
      <Hsp_hseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_hseq>
      <Hsp_midline>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_midline>
    </Hsp>
  </Hit_hsps>
(...)
We want to insert that XML file into a database. I wrote the following XSLT stylesheet , it transforms the blast-xml into a set of SQL statements for sqlite3.
xsltproc --novalid blast2sqlite.xsl blast.xml 

create table BlastOutput(
(...)


BEGIN TRANSACTION;

insert into BlastOutput(
 program,
 version,
 reference,
 db,
 query_ID,
 query_def,
 query_len
 )
 values (
  'tblastn',
  'TBLASTN 2.2.27+',
  'Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang,Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.',
  'nr',
  '52385',
  'myseq',
  30
  );


insert into Parameters(
 blastOutput_id,
 expect,
 matrix,
 sc_match,
 sc_mismatch,
 gap_open,
 gap_extend,
 filter
 )
select MAX(id),
 10,
 'BLOSUM62',
 NULL,
 NULL,
 11,
 1,
 'L;'
 from BlastOutput;
 


insert into Iteration(
 blastOutput_id,
 iter_num,
 query_id,
 query_def,
 query_len
 )
select MAX(id),
 1,
 '52385',
 'myseq',
 30
from BlastOutput;

insert into Hit(iteration_id,num,hit_id,def,accession,len)
select MAX(id),
 1,
 'gi|110624327|dbj|AK225891.1|',
 'Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02',
 'AK225891',
 1829
from Iteration;
(...)
All in one you can redirect the output to sqlite3.
xsltproc --novalid blast2sqlite.xsl blast.xml |\
 sqlite3 input.db
and query the database:
$ sqlite3 -header -line input.sqlite \
 'select * from Hsp,Hit where Hsp.hit_id=Hit.id limit 2'

          id = 1
      hit_id = 1
         num = 1
   bit_score = 62.3882
       score = 150.0
      evalue = 4.01658e-10
  query_from = 1
    query_to = 30
    hit_from = 250
      hit_to = 339
 query_frame = 0
   hit_frame = 1
    identity = 30
    positive = 30
        gaps = 0
   align_len = 30
        qseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
        hseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
     midline = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
          id = 1
iteration_id = 1
         num = 1
      hit_id = gi|110624327|dbj|AK225891.1|
         def = Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02
   accession = AK225891
         len = 1829

          id = 2
      hit_id = 2
         num = 1
   bit_score = 62.3882
       score = 150.0
      evalue = 4.12279e-10
  query_from = 1
    query_to = 30
    hit_from = 105
      hit_to = 194
 query_frame = 0
   hit_frame = 3
    identity = 30
    positive = 30
        gaps = 0
   align_len = 30
        qseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
        hseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
     midline = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
          id = 2
iteration_id = 1
         num = 2
      hit_id = gi|6176337|gb|AF188530.1|AF188530
         def = Homo sapiens ubiquitous tetratricopeptide containing protein RoXaN mRNA, partial cds
   accession = AF188530
         len = 2398
That's it,

Pierre

22 March 2011

Blast Stylesheet : XML to HTML

I wrote a XSLT stylesheet for the following question on Biostar: I'd like to create an HTML file (from the XML file and XSL stylesheet) similar to what It can be achieved when we performed a BLAST search on the NCBI server.

The stylesheet I wrote is available on github at: https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2html.xsl. (see also my previous post blast2svg )

Usage:

xsltproc --novalid blast2html.xsl blast.xml > blast.html

Example:

Here is a XML output of blast:
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.2.25+</BlastOutput_version>
<BlastOutput_reference>Alejandro A. Sch&auml;ffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005.</BlastOutput_reference>
<BlastOutput_db>N/A</BlastOutput_db>
<BlastOutput_query-ID>gi|187956781|gb|AAI40897.1|</BlastOutput_query-ID>
<BlastOutput_query-def>EIF4G1 protein [Homo sapiens]</BlastOutput_query-def>
<BlastOutput_query-len>1606</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>10</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|187956781|gb|AAI40897.1|</Iteration_query-ID>
<Iteration_query-def>EIF4G1 protein [Homo sapiens]</Iteration_query-def>
<Iteration_query-len>1606</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|293340930|ref|XP_002724789.1|</Hit_id>
<Hit_def>PREDICTED: eukaryotic translation initiation factor 4 gamma, 1 isoform 2 [Rattus norvegicus] >gi|293352298|ref|XP_002727969.1| PREDICTED: eukaryotic translation initiation factor 4, gamma 1 isoform 1 [Rattus norvegicus]</Hit_def>
<Hit_accession>XP_002727969</Hit_accession>
<Hit_len>1584</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>2715.64</Hsp_bit-score>
<Hsp_score>7038</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>1606</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>1584</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>1450</Hsp_identity>
<Hsp_positive>1450</Hsp_positive>
<Hsp_gaps>36</Hsp_gaps>
<Hsp_align-len>1613</Hsp_align-len>
<Hsp_qseq>MNKAPQSTGPPPAPSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQGGFRSLQHFYPSRAQPPSSAASRVQSAAPARPGPAAHVYPAGSQVMMIPSQISYPASQGAYYIPGQGRSTYVVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQGVQQFPTGVAPAPVLMNQPPQIAPKRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGGLEPQANGETPQVAVIVRPDDRSQGAIIADRPGLPGPEHSP-SESQPSSPSPTPSPSPVLEPGSEPNLAVLSIPGDTMTT--IQMSVEESTPISRETGEPYRLSPEPTPLAEPILEVEVTLSKPVPESEFSSSPLQAPTPLASHTVEIHEPNGMVPSEDLEPEVESSPELAPPP--ACPSESPVPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEV-TASMAPPTIPSATPATAPSATSPAQEEEMEEEEEEEEGEAGEAGEAESEKGGEELLPPESTPIPANLSQNLEAAAATQVAVSVPKRRRKIKELNKKEAVGDLLDAFKEANPAVPEVENQPPAGSNPGPESEGSGVPPRPEEADETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHISDVVLDKANKTPLRPLDPTRLQGINCGPDFTPSFANLGRTTLSTRGPPRGGPGGELPRGPAGLGPRRSQQGPRKEPRKIIATVLMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDFEKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRGSNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKGSDKRRGGPPGPPISRGLPLVDDGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDAASEAARPATSTLNRFSALQQAVPTESTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTEDRDRGRDAVKREAALPPVSPLKAALSEEELEKKSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRHGVESTLERSAIAREHMGQLLHQLLCAGHLSTAQYYQGLYEILELAEDMEIDIPHVWLYLAELVTPILQEGGVPMGELFREITKPLRPLGKAASLLLEILGLLCKSMGPKKVGTLWREAGLSWKEFLPEGQDIGAFVAEQKVEYTLGEESEAPGQRALPSEELNRQLEKLLKEGSSNQRVFDWIEANLSEQQIVSNTLVRALMTAVCYSAIIFETPLRVDVAVLKARAKLLQKYLCDEQKELQALYALQALVVTLEQPPNLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFKWLREAE-EESDHN</Hsp_qseq>
<Hsp_hseq>MNKAPQPTGPPPARSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ-------HFYPSRAQPPSSAASRVQSAAPARPGPAPHVYPAGSQVMMIPSQISYSASQGAYYIPGQGRSTYVVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQSVQQFPASVAPAPVLMNQPPQIAPKRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGSLEPQPNGESPQVAVIIRPDDRSQGAAIGGRPGLPGPEHSPGTESQPSSPSPTPSPPPILEPGSESNLGVLSIPGDTMTTGMIPISVEESTPISCESGEPYCLSPEPT-LAEPILEVEVTLSKPIPESEFSSSPLQVSTSLVPHRAETHEPNGVIPSEDLEPEVESSTEPAPPPLSACASESLVPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEVPSAALA--SIVSPTPPVAPSDTSAAQEEEIEED-------EDEDGEAESEKGGEDL-PLDSTPVPAQLSQNLEVAAAPQVAVSVPKRRRKIKELNKKEAVGDLLDAFKEVDPAVPEVENQPPTGSNPSPESEGSAALPQPEEAEETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHITDVVLDKANKTPLRSLDPSRLPGINCGPDFTPSFANLGRPTLSSRGPPRGGPGGELPRGPAGLGPRRSQQGPRKETRKIISSVIMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDFAKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRQSNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKGGDKRRGGPPGPP-------VNDGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDTASEATRPA--TLNRFSALQQTLPVENTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTE--DRGRDPVKREATLPPVSPPKAALAVDEVERKSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRLGIESTLERSTIAREHMGRLLHQLLCAGHLSTAQYYQGLYETLELAEDMEIDIPHVWLYLAELITPILQEDGVPMGELFREITKPLRPMGKATSLLLEILGLLCKSMGPKKVGMLWREAGLSWREFLAEGQDVGSFVAEKKVEYTLGEESEAPGQRALAFEELRRQLEKLLKDGGSNQRVFDWIEANLNEQQIASNTLVRALMTTVCYSAIIFETPLRVDVQVLKVRARLLQKYLSDEQKELQALYALQALVVTLEQPANLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFNWLREAEDEESDHN</Hsp_hseq>
<Hsp_midline>MNKAPQ TGPPPA SPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ HFYPSRAQPPSSAASRVQSAAPARPGPA HVYPAGSQVMMIPSQISY ASQGAYYIPGQGRSTYVVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQ VQQFP VAPAPVLMNQPPQIAPKRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGG LEPQ NGE PQVAVI RPDDRSQGA I RPGLPGPEHSP ESQPSSPSPTPSP P LEPGSE NL VLSIPGDTMTT I SVEESTPIS E GEPY LSPEPT LAEPILEVEVTLSKP PESEFSSSPLQ T L H E HEPNG PSEDLEPEVESS E APPP AC SES VPIAPTAQPEELLNGAPSPPAVDLSPVSEPEEQAKEV A A I S TP APS TS AQEEE EE E GEAESEKGGE L P STP PA LSQNLE AAA QVAVSVPKRRRKIKELNKKEAVGDLLDAFKE PAVPEVENQPP GSNP PESEGS P PEEA ETWDSKEDKIHNAENIQPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHI DVVLDKANKTPLR LDP RL GINCGPDFTPSFANLGR TLS RGPPRGGPGGELPRGPAGLGPRRSQQGPRKE RKII V MTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSILNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKVPTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEARDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLDF KAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLR SNWVPRRGDQGPKTIDQIHKEAEMEEHREHIKVQQLMAKG DKRRGGPPGPP V DGGWNTVPISKGSRPIDTSRLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSD ASEA RPA TLNRFSALQQ P E TDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFSKEVEERSRERPSQPEGLRKAASLTE DRGRD VKREA LPPVSP KAAL E E KSKAIIEEYLHLNDMKEAVQCVQELASPSLLFIFVR G ESTLERS IAREHMG LLHQLLCAGHLSTAQYYQGLYE LELAEDMEIDIPHVWLYLAEL TPILQE GVPMGELFREITKPLRP GKA SLLLEILGLLCKSMGPKKVG LWREAGLSW EFL EGQD G FVAE KVEYTLGEESEAPGQRAL EEL RQLEKLLK G SNQRVFDWIEANL EQQI SNTLVRALMT VCYSAIIFETPLRVDV VLK RA LLQKYL DEQKELQALYALQALVVTLEQP NLLRMFFDALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFF WLREAE EESDHN</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

After processing:

(...)

Descriptions

AccessionDefe-value
XP_002727969PREDICTED: eukaryotic translation initiation factor 4 gamma, 1 isoform 2 [Rattus norvegicus] >gi|293352298|ref|XP_002727969.1| PREDICTED: eukaryotic translation initiation factor 4, gamma 1 isoform 1 [Rattus norvegicus]0
(...)

Alignments

>gi|293340930|ref|XP_002724789.1||XP_002727969|PREDICTED: eukaryotic translation initiation factor 4 gamma, 1 isoform 2 [Rattus norvegicus] >gi|293352298|ref|XP_002727969.1| PREDICTED: eukaryotic translation initiation factor 4, gamma 1 isoform 1 [Rattus norvegicus]
Length=1584
Score = 2715.64 bits (7038), Expect = 0
Identities = 1450/1613 (89.8946063236206%), Gaps = 36/1613 (2.231866088034718%)
Strand = Plus/Plus

Query 1 MNKAPQSTGPPPAPSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQGGFRSLQHFYP 60
MNKAPQ TGPPPA SPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ HFYP
Sbjct 1 MNKAPQPTGPPPARSPGLPQPAFPPGQTAPVVFSTPQATQMNTPSQPRQ-------HFYP 53

Query 61 SRAQPPSSAASRVQSAAPARPGPAAHVYPAGSQVMMIPSQISYPASQGAYYIPGQGRSTY 120
SRAQPPSSAASRVQSAAPARPGPA HVYPAGSQVMMIPSQISY ASQGAYYIPGQGRSTY
Sbjct 54 SRAQPPSSAASRVQSAAPARPGPAPHVYPAGSQVMMIPSQISYSASQGAYYIPGQGRSTY 113

Query 121 VVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQGVQQFPTGVAPAPVLMNQPPQIAP 180
VVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQ VQQFP VAPAPVLMNQPPQIAP
Sbjct 114 VVPTQQYPVQPGAPGFYPGASPTEFGTYAGAYYPAQSVQQFPASVAPAPVLMNQPPQIAP 173

Query 181 KRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGGLEPQANGETPQVAVIVRPD 240
KRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGG LEPQ NGE PQVAVI RPD
Sbjct 174 KRERKTIRIRDPNQGGKDITEEIMSGARTASTPTPPQTGGSLEPQPNGESPQVAVIIRPD 233

Query 241 DRSQGAIIADRPGLPGPEHSP-SESQPSSPSPTPSPSPVLEPGSEPNLAVLSIPGDTMTT 299
DRSQGA I RPGLPGPEHSP ESQPSSPSPTPSP P LEPGSE NL VLSIPGDTMTT
Sbjct 234 DRSQGAAIGGRPGLPGPEHSPGTESQPSSPSPTPSPPPILEPGSESNLGVLSIPGDTMTT 293

Query 300 --IQMSVEESTPISRETGEPYRLSPEPTPLAEPILEVEVTLSKPVPESEFSSSPLQAPTP 357
I SVEESTPIS E GEPY LSPEPT LAEPILEVEVTLSKP PESEFSSSPLQ T
Sbjct 294 GMIPISVEESTPISCESGEPYCLSPEPT-LAEPILEVEVTLSKPIPESEFSSSPLQVSTS 352

Query 358 LASHTVEIHEPNGMVPSEDLEPEVESSPELAPPP--ACPSESPVPIAPTAQPEELLNGAP 415
L H E HEPNG PSEDLEPEVESS E APPP AC SES VPIAPTAQPEELLNGAP
Sbjct 353 LVPHRAETHEPNGVIPSEDLEPEVESSTEPAPPPLSACASESLVPIAPTAQPEELLNGAP 412

Query 416 SPPAVDLSPVSEPEEQAKEV-TASMAPPTIPSATPATAPSATSPAQEEEMEEEEEEEEGE 474
SPPAVDLSPVSEPEEQAKEV A A I S TP APS TS AQEEE EE
Sbjct 413 SPPAVDLSPVSEPEEQAKEVPSAALA--SIVSPTPPVAPSDTSAAQEEEIEED------- 463

Query 475 AGEAGEAESEKGGEELLPPESTPIPANLSQNLEAAAATQVAVSVPKRRRKIKELNKKEAV 534
E GEAESEKGGE L P STP PA LSQNLE AAA QVAVSVPKRRRKIKELNKKEAV
Sbjct 464 EDEDGEAESEKGGEDL-PLDSTPVPAQLSQNLEVAAAPQVAVSVPKRRRKIKELNKKEAV 522

Query 535 GDLLDAFKEANPAVPEVENQPPAGSNPGPESEGSGVPPRPEEADETWDSKEDKIHNAENI 594
GDLLDAFKE PAVPEVENQPP GSNP PESEGS P PEEA ETWDSKEDKIHNAENI
Sbjct 523 GDLLDAFKEVDPAVPEVENQPPTGSNPSPESEGSAALPQPEEAEETWDSKEDKIHNAENI 582

Query 595 QPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHISDVVLDKANKT 654
QPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHI DVVLDKANKT
Sbjct 583 QPGEQKYEYKSDQWKPLNLEEKKRYDREFLLGFQFIFASMQKPEGLPHITDVVLDKANKT 642

Query 655 PLRPLDPTRLQGINCGPDFTPSFANLGRTTLSTRGPPRGGPGGELPRGPAGLGPRRSQQG 714
PLR LDP RL GINCGPDFTPSFANLGR TLS RGPPRGGPGGELPRGPAGLGPRRSQQG
Sbjct 643 PLRSLDPSRLPGINCGPDFTPSFANLGRPTLSSRGPPRGGPGGELPRGPAGLGPRRSQQG 702

Query 715 PRKEPRKIIATVLMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSI 774
PRKE RKII V MTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSI
Sbjct 703 PRKETRKIISSVIMTEDIKLNKAEKAWKPSSKRTAADKDRGEEDADGSKTQDLFRRVRSI 762

Query 775 LNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKV 834
LNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKV
Sbjct 763 LNKLTPQMFQQLMKQVTQLAIDTEERLKGVIDLIFEKAISEPNFSVAYANMCRCLMALKV 822

Query 835 PTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEA 894
PTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEA
Sbjct 823 PTTEKPTVTVNFRKLLLNRCQKEFEKDKDDDEVFEKKQKEMDEAATAEERGRLKEELEEA 882

Query 895 RDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLD 954
RDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLD
Sbjct 883 RDIARRRSLGNIKFIGELFKLKMLTEAIMHDCVVKLLKNHDEESLECLCRLLTTIGKDLD 942

Query 955 FEKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRGSNWVPRRGDQGPKTIDQIHK 1014
F KAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLR SNWVPRRGDQGPKTIDQIHK
Sbjct 943 FAKAKPRMDQYFNQMEKIIKEKKTSSRIRFMLQDVLDLRQSNWVPRRGDQGPKTIDQIHK 1002

Query 1015 EAEMEEHREHIKVQQLMAKGSDKRRGGPPGPPISRGLPLVDDGGWNTVPISKGSRPIDTS 1074
EAEMEEHREHIKVQQLMAKG DKRRGGPPGPP V DGGWNTVPISKGSRPIDTS
Sbjct 1003 EAEMEEHREHIKVQQLMAKGGDKRRGGPPGPP-------VNDGGWNTVPISKGSRPIDTS 1055

Query 1075 RLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDAASEAARPATSTLNRFSALQ 1134
RLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSD ASEA RPA TLNRFSALQ
Sbjct 1056 RLTKITKPGSIDSNNQLFAPGGRLSWGKGSSGGSGAKPSDTASEATRPA--TLNRFSALQ 1113

Query 1135 QAVPTESTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFS 1194
Q P E TDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFS
Sbjct 1114 QTLPVENTDNRRVVQRSSLSRERGEKAGDRGDRLERSERGGDRGDRLDRARTPATKRSFS 1173

Query 1195 KEVEERSRERPSQPEGLRKAASLTEDRDRGRDAVKREAALPPVSPLKAALSEEELEKKSK 1254
KEVEERSRERPSQPEGLRKAASLTE DRGRD VKREA LPPVSP KAAL E E KSK
Sbjct 1174 KEVEERSRERPSQPEGLRKAASLTE--DRGRDPVKREATLPPVSPPKAALAVDEVERKSK 1231

Query 1255 AIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRHGVESTLERSAIAREHMGQLLHQLLCA 1314
AIIEEYLHLNDMKEAVQCVQELASPSLLFIFVR G ESTLERS IAREHMG LLHQLLCA
Sbjct 1232 AIIEEYLHLNDMKEAVQCVQELASPSLLFIFVRLGIESTLERSTIAREHMGRLLHQLLCA 1291

Query 1315 GHLSTAQYYQGLYEILELAEDMEIDIPHVWLYLAELVTPILQEGGVPMGELFREITKPLR 1374
GHLSTAQYYQGLYE LELAEDMEIDIPHVWLYLAEL TPILQE GVPMGELFREITKPLR
Sbjct 1292 GHLSTAQYYQGLYETLELAEDMEIDIPHVWLYLAELITPILQEDGVPMGELFREITKPLR 1351

Query 1375 PLGKAASLLLEILGLLCKSMGPKKVGTLWREAGLSWKEFLPEGQDIGAFVAEQKVEYTLG 1434
P GKA SLLLEILGLLCKSMGPKKVG LWREAGLSW EFL EGQD G FVAE KVEYTLG
Sbjct 1352 PMGKATSLLLEILGLLCKSMGPKKVGMLWREAGLSWREFLAEGQDVGSFVAEKKVEYTLG 1411

Query 1435 EESEAPGQRALPSEELNRQLEKLLKEGSSNQRVFDWIEANLSEQQIVSNTLVRALMTAVC 1494
EESEAPGQRAL EEL RQLEKLLK G SNQRVFDWIEANL EQQI SNTLVRALMT VC
Sbjct 1412 EESEAPGQRALAFEELRRQLEKLLKDGGSNQRVFDWIEANLNEQQIASNTLVRALMTTVC 1471

Query 1495 YSAIIFETPLRVDVAVLKARAKLLQKYLCDEQKELQALYALQALVVTLEQPPNLLRMFFD 1554
YSAIIFETPLRVDV VLK RA LLQKYL DEQKELQALYALQALVVTLEQP NLLRMFFD
Sbjct 1472 YSAIIFETPLRVDVQVLKVRARLLQKYLSDEQKELQALYALQALVVTLEQPANLLRMFFD 1531

Query 1555 ALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFKWLREAE-EESDHN 1606
ALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFF WLREAE EESDHN
Sbjct 1532 ALYDEDVVKEDAFYSWESSKDPAEQQGKGVALKSVTAFFNWLREAEDEESDHN 1584



That's it,

Pierre

17 November 2010

BLAST/XML+Annotations

I recently asked on Biostar if it would be possible to align two sequences while displaying their respective annotations.

As both answers I received (SPICE and jalview ) require a graphical interface, I quickly wrote a command-line java program doing the job. This program reads a NCBI/BLAST XML output and, if the 'query' or the 'hit' definition lines start with "gi|....", it fetches the Genbank records and the annotations for the sequence and map them onto the alignments.

The program is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BlastAnnotation.java.

Do we need an external library parsing Blast?

No, the java binding compiler, ${JAVA_HOME}/bin/xjc, can generate a java parser from BLAST DTD:
xjc -d src -p sandbox.ncbi.blast -dtd http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/blast/BlastOutput.java
sandbox/ncbi/blast/BlastOutputIterations.java
sandbox/ncbi/blast/BlastOutputMbstat.java
sandbox/ncbi/blast/BlastOutputParam.java
sandbox/ncbi/blast/Hit.java
sandbox/ncbi/blast/HitHsps.java
sandbox/ncbi/blast/Hsp.java
sandbox/ncbi/blast/Iteration.java
sandbox/ncbi/blast/IterationHits.java
sandbox/ncbi/blast/IterationStat.java
sandbox/ncbi/blast/ObjectFactory.java
sandbox/ncbi/blast/Parameters.java
sandbox/ncbi/blast/Statistics.java


And do we need an external library parsing Genbank?

No, again xjc did the job:
xjc -d src -p sandbox.ncbi.gbc -dtd http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/gbc/INSDAltSeqData.java
sandbox/ncbi/gbc/INSDAltSeqDataItems.java
sandbox/ncbi/gbc/INSDAltSeqItem.java
sandbox/ncbi/gbc/INSDAltSeqItemInterval.java
sandbox/ncbi/gbc/INSDAltSeqItemIsgap.java
sandbox/ncbi/gbc/INSDAuthor.java
sandbox/ncbi/gbc/INSDComment.java
sandbox/ncbi/gbc/INSDCommentItem.java
sandbox/ncbi/gbc/INSDCommentParagraph.java
sandbox/ncbi/gbc/INSDCommentParagraphItems.java
sandbox/ncbi/gbc/INSDCommentParagraphs.java
sandbox/ncbi/gbc/INSDFeature.java
sandbox/ncbi/gbc/INSDFeatureIntervals.java
sandbox/ncbi/gbc/INSDFeaturePartial3.java
sandbox/ncbi/gbc/INSDFeaturePartial5.java
sandbox/ncbi/gbc/INSDFeatureQuals.java
sandbox/ncbi/gbc/INSDFeatureSet.java
sandbox/ncbi/gbc/INSDFeatureSetFeatures.java
sandbox/ncbi/gbc/INSDFeatureXrefs.java
sandbox/ncbi/gbc/INSDInterval.java
sandbox/ncbi/gbc/INSDIntervalInterbp.java
sandbox/ncbi/gbc/INSDIntervalIscomp.java
sandbox/ncbi/gbc/INSDKeyword.java
sandbox/ncbi/gbc/INSDQualifier.java
sandbox/ncbi/gbc/INSDReference.java
sandbox/ncbi/gbc/INSDReferenceAuthors.java
sandbox/ncbi/gbc/INSDReferenceXref.java
sandbox/ncbi/gbc/INSDSecondaryAccn.java
sandbox/ncbi/gbc/INSDSeq.java
sandbox/ncbi/gbc/INSDSeqAltSeq.java
sandbox/ncbi/gbc/INSDSeqCommentSet.java
sandbox/ncbi/gbc/INSDSeqFeatureSet.java
sandbox/ncbi/gbc/INSDSeqFeatureTable.java
sandbox/ncbi/gbc/INSDSeqKeywords.java
sandbox/ncbi/gbc/INSDSeqOtherSeqids.java
sandbox/ncbi/gbc/INSDSeqReferences.java
sandbox/ncbi/gbc/INSDSeqSecondaryAccessions.java
sandbox/ncbi/gbc/INSDSeqStrucComments.java
sandbox/ncbi/gbc/INSDSeqid.java
sandbox/ncbi/gbc/INSDSet.java
sandbox/ncbi/gbc/INSDStrucComment.java
sandbox/ncbi/gbc/INSDStrucCommentItem.java
sandbox/ncbi/gbc/INSDStrucCommentItems.java
sandbox/ncbi/gbc/INSDXref.java
sandbox/ncbi/gbc/ObjectFactory.java

Example


As an example I've aligned the "human eif4G1" (gi|303227906) with "Mus musculus eif4G1" (gi|56699433).
The very first lines of the BLAST report are:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dt
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>BLASTN 2.2.24+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch
<BlastOutput_db>n/a</BlastOutput_db>
<BlastOutput_query-ID>gi|303227906|ref|NM_198241.2|</BlastOutput_query-ID>
<BlastOutput_query-def>Homo sapiens eukaryotic translation initiation factor 4
<BlastOutput_query-len>5538</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>2</Parameters_sc-match>
<Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
<Parameters_gap-open>5</Parameters_gap-open>
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|303227906|ref|NM_198241.2|</Iteration_query-ID>
<Iteration_query-def>Homo sapiens eukaryotic translation initiation factor 4 g
<Iteration_query-len>5538</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|56699433|ref|NM_001005331.1|</Hit_id>
<Hit_def>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif
<Hit_accession>NM_001005331</Hit_accession>
<Hit_len>5460</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>6818.02</Hsp_bit-score>
<Hsp_score>7560</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>53</Hsp_query-from>
<Hsp_query-to>5538</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>5418</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>4820</Hsp_identity>
<Hsp_positive>4820</Hsp_positive>
<Hsp_gaps>138</Hsp_gaps>
<Hsp_align-len>5521</Hsp_align-len>
<Hsp_qseq>GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_hseq>GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_midline>||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||...
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

And here is the output of my program:

java -jar dist/blastannot.jar ~/jeter.blast.xml

QUERY: Homo sapiens eukaryotic translation initiation factor 4 gamma, 1 (EIF4G1), transcript variant 2, mRNA
ID:gi|303227906|ref|NM_198241.2| Len:5538
>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif4g1), transcript variant 2, mRNA
NM_001005331
id:gi|56699433|ref|NM_001005331.1| len:5460

e-value:0 gap:138 bitScore:6818.02

#####:############################################ exon 1..180 gene:EIF4G1
QUERY 000000053 GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000102
||||| ||||||||||||||||||||||||||||||||||||||||||||
HIT 000000001 GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000050
#####:############################################ exon 1..128 gene:Eif4g1



################################################## exon 1..180 gene:EIF4G1
QUERY 000000103 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000152
||||||||||||||||||||||||||||||||||||||||||||||||||
HIT 000000051 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000100
################################################## exon 1..128 gene:Eif4g1



############::::::::::###### exon 1..180 gene:EIF4G1
#:::::::::::::###::::: exon 181..237 gene:EIF4G1
QUERY 000000153 CGGCCGCGCGGACTGAAGGAGACTGAAGGCCCTCGGATGCCCAGAACCTG 000000202
|||||||||||| ||||||| |||
HIT 000000101 CGGCCGCGCGGA----------CTGAAGG-------------AGA----- 000000122
############----------#######-------------### gene 1..5460 gene:Eif4g1
############----------#######-------------### exon 1..128 gene:Eif4g1



::::::::::::::::::::::##:##:::::::# exon 181..237 gene:EIF4G1
############### exon 238..331 gene:EIF4G1
QUERY 000000203 TAGGCCGCACCGTGGACTTGTTCTTAATCGAGGGGGTGCTGGGGGGACCC 000000252
|| || ||||||||||||||||
HIT 000000123 ----------------------CTGAA-------GGTGCTGGGGGGACCC 000000143
----------------------##:##-------# exon 1..128 gene:Eif4g1
############### exon 129..222 gene:Eif4g1



#:###############################:###:############ exon 238..331 gene:EIF4G1
##############:###:############ CDS 272..5071 gene:EIF4G1
QUERY 000000253 TGATGTGGCACCAAATGAAATGAACAAAGCTCCACAGTCCACAGGCCCCC 000000302
| ||||||||||||||||||||||||||||||| ||| ||||||||||||
HIT 000000144 TAATGTGGCACCAAATGAAATGAACAAAGCTCCCCAGCCCACAGGCCCCC 000000193
#:###############################:###:############ exon 129..222 gene:Eif4g1
##############:###:############ CDS 163..4944 gene:Eif4g1


(...)


############:#:#:#####:######:#:########:##:###### exon 4890..5521 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 4948..5505 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 5174..5403 gene:EIF4G1
QUERY 000005319 TTGGTGTGTCTTGGGGTGGGGAGGGGCACCAACGCCTGCCCCTGGGGTCC 000005368
|||||||||||| | | ||||| |||||| | |||||||| || ||||||
HIT 000005201 TTGGTGTGTCTTTGCGGGGGGAAGGGCACTACCGCCTGCCTCTAGGGTCC 000005250
############:#:#:#####:######:#:########:##:###### exon 4760..5396 gene:Eif4g1



::##############:##########:###################### exon 4890..5521 gene:EIF4G1
::##############:##########:###################### STS 4948..5505 gene:EIF4G1
::##############:##########:####### STS 5174..5403 gene:EIF4G1
QUERY 000005369 TTTTTTTTATTTTCTGAAAATCACTCTCGGGACTGCCGTCCTCGCTGCTG 000005418
|||||||||||||| |||||||||| ||||||||||||||||||||||
HIT 000005251 --TTTTTTATTTTCTG-AAATCACTCTTGGGACTGCCGTCCTCGCTGCTG 000005297
--##############-##########:###################### exon 4760..5396 gene:Eif4g1



######################:#############:############# exon 4890..5521 gene:EIF4G1
######################:#############:############# STS 4948..5505 gene:EIF4G1
QUERY 000005419 GGGGCATATGCCCCAGCCCCTGTACCACCCCTGCTGTTGCCTGGGCAGGG 000005468
|||||||||||||||||||||| ||||||||||||| |||||||||||||
HIT 000005298 GGGGCATATGCCCCAGCCCCTGCACCACCCCTGCTGCTGCCTGGGCAGGG 000005347
######################:#############:############# exon 4760..5396 gene:Eif4g1



#:##-############################################: exon 4890..5521 gene:EIF4G1
#:##-################################# STS 4948..5505 gene:EIF4G1
###### polyA_signal 5496..5501 gene:EIF4G1
# polyA_site 5516 gene:EIF4G1
QUERY 000005469 GGAA-GGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAG 000005517
| || ||||||||||||||||||||||||||||||||||||||||||||
HIT 000005348 GAAAGGGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAA 000005397
#:##:############################################ exon 4760..5396 gene:Eif4g1



:::# exon 4890..5521 gene:EIF4G1
# polyA_site 5521 gene:EIF4G1
QUERY 000005518 CTCAAAAAAAAAAAAAAAAAA 000005538
||||||||||||||||||
HIT 000005398 AAAAAAAAAAAAAAAAAAAAA 000005418



That's it,
Pierre

04 September 2009

My Notes about Bowtie


Image via wikipedia

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie is open source http://bowtie.cbcb.umd.edu (Langmead B, & al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome". Genome Biol. 2009;10(3):R25.)

Building an indexed Database


The fasta sequences retrieved from the query
"ROTAVIRUS[organism] VP1 Human"
were saved to the file ./genomes/rotavirus.fna.
The file was then indexed with ./bowtie-build
./bowtie-build genomes/rotavirus.fna indexes/rotavirus

Settings:
Output files: "rotavirus.*.ebwt"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 5 (one in 32)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Reference base cutoff: none
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:4, int:4, long:4, size_t:4
Input files DNA, FASTA:
genomes/rotavirus.fna
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
= 316434 (0 characters of padding)
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 79108
Using parameters --bmax 59331 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 59331 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 6; iterating...
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 0; iterating...
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 39553.4 (target: 59330)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 38593
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 38594
Getting block 2 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 53650
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 53651
Getting block 3 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 45711
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 45712
Getting block 4 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 18201
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 18202
Getting block 5 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 56112
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 56113
Getting block 6 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 24508
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 24509
Getting block 7 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 52958
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 52959
Getting block 8 of 8
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 26694
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 26695
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 117471
fchr[G]: 165360
fchr[T]: 220740
fchr[$]: 316434
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4302635 bytes to primary EBWT file: rotavirus.1.ebwt
Wrote 39560 bytes to secondary EBWT file: rotavirus.2.ebwt
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 316434
bwtLen: 316435
sz: 79109
bwtSz: 79109
lineRate: 6
linesPerSide: 1
offRate: 5
offMask: 0xffffffe0
isaRate: -1
isaMask: 0xffffffff
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 9889
offsSz: 39556
isaLen: 0
isaSz: 0
lineSz: 64
sideSz: 64
sideBwtSz: 56
sideBwtLen: 224
numSidePairs: 707
numSides: 1414
numLines: 1414
ebwtTotLen: 90496
ebwtTotSz: 90496
Total time for call to driver() for forward index: 00:00:01
Reading reference sizes
Time reading reference sizes: 00:00:00
Calculating joined length
= 316434 (0 characters of padding)
Writing header
Reserving space for joined string
Joining reference sequences
Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 79108
Using parameters --bmax 59331 --dcv 1024
Doing ahead-of-time memory usage test
Passed! Constructing with these parameters: --bmax 59331 --dcv 1024
Constructing suffix-array element generator
Building DifferenceCoverSample
Building sPrime
Building sPrimeOrder
V-Sorting samples
V-Sorting samples time: 00:00:00
Allocating rank array
Ranking v-sort output
Ranking v-sort output time: 00:00:00
Invoking Larsson-Sadakane on ranks
Invoking Larsson-Sadakane on ranks time: 00:00:00
Sanity-checking and returning
Building samples
Reserving space for 12 sample suffixes
Generating random suffixes
QSorting 12 sample offsets, eliminating duplicates
QSorting sample offsets, eliminating duplicates time: 00:00:00
Multikey QSorting 12 samples
(Using difference cover)
Multikey QSorting samples time: 00:00:00
Calculating bucket sizes
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Split 1, merged 7; iterating...
Binary sorting into buckets
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Binary sorting into buckets time: 00:00:00
Splitting and merging
Splitting and merging time: 00:00:00
Avg bucket size: 45204 (target: 59330)
Converting suffix-array elements to index image
Allocating ftab, absorbFtab
Entering Ebwt loop
Getting block 1 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 55484
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 55485
Getting block 2 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 58214
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 58215
Getting block 3 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 58224
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 58225
Getting block 4 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 33666
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 33667
Getting block 5 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 48159
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 48160
Getting block 6 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 45260
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 45261
Getting block 7 of 7
Reserving size (59331) for bucket
Calculating Z arrays
Calculating Z arrays time: 00:00:00
Entering block accumulator loop:
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Block accumulator loop time: 00:00:00
Sorting block of length 17421
(Using difference cover)
Sorting block time: 00:00:00
Returning block of 17422
Exited Ebwt loop
fchr[A]: 0
fchr[C]: 117471
fchr[G]: 165360
fchr[T]: 220740
fchr[$]: 316434
Exiting Ebwt::buildToDisk()
Returning from initFromVector
Wrote 4302635 bytes to primary EBWT file: rotavirus.rev.1.ebwt
Wrote 39560 bytes to secondary EBWT file: rotavirus.rev.2.ebwt
Re-opening _in1 and _in2 as input streams
Returning from Ebwt constructor
Headers:
len: 316434
bwtLen: 316435
sz: 79109
bwtSz: 79109
lineRate: 6
linesPerSide: 1
offRate: 5
offMask: 0xffffffe0
isaRate: -1
isaMask: 0xffffffff
ftabChars: 10
eftabLen: 20
eftabSz: 80
ftabLen: 1048577
ftabSz: 4194308
offsLen: 9889
offsSz: 39556
isaLen: 0
isaSz: 0
lineSz: 64
sideSz: 64
sideBwtSz: 56
sideBwtLen: 224
numSidePairs: 707
numSides: 1414
numLines: 1414
ebwtTotLen: 90496
ebwtTotSz: 90496
Total time for backward call to driver() for mirror index: 00:00:00


Searching


The short sequences found with the query "NCBI: ROTAVIRUS[organism] 10:150[SLEN] bovine" were downloaded to ./reads/short-rota.fasta.
Those sequences were then mapped on the indexed sequences:
./bowtie -f rotavirus reads/short-rota.fasta | java -jar ~/lindenb/build/verticalize.jar -n
Reported 14 alignments to 1 output stream(s)
>>1
$1 ? : gi|62781718|gb|AR643436.1| Sequence 6 from patent US 6867021
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7504
$5 ? : GTGGCTTCCATTAGAAGCATG
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 1
<<1
>>2
$1 ? : gi|62781717|gb|AR643435.1| Sequence 5 from patent US 6867021
$2 ? : +
$3 ? : gi|66774335|gb|AH014892.1|SEG_DQ005104S Human rotavirus A strain DRC88 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7230
$5 ? : ACCACCAAATATGACACCAGC
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 1
<<2
>>3
$1 ? : gi|56638523|gb|AR590758.1| Sequence 32 from patent US 6805867
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 5633
$5 ? : GGATGGCCAACAGGATC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
$8 ? : 2:T>A,5:T>A
<<3
>>4
$1 ? : gi|56638522|gb|AR590757.1| Sequence 31 from patent US 6805867
$2 ? : +
$3 ? : gi|5706619|gb|AF106283.1|AF106283 Human rotavirus G2 strain TA25 VP7 protein (VP7) gene, complete cds
$4 ? : 22
$5 ? : GTATGGTATTGAATATACCAC
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 16
<<4
>>5
$1 ? : gi|56638509|gb|AR590744.1| Sequence 18 from patent US 6805867
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7658
$5 ? : TAGTGAGAGGATGTGACC
$6 ? : IIIIIIIIIIIIIIIIII
$7 ? : 1
<<5
>>6
$1 ? : gi|56638508|gb|AR590743.1| Sequence 17 from patent US 6805867
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 6320
$5 ? : GGCTTTTAAACGAAGTC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
<<6
>>7
$1 ? : gi|56638505|gb|AR590740.1| Sequence 14 from patent US 6805867
$2 ? : -
$3 ? : gi|66774335|gb|AH014892.1|SEG_DQ005104S Human rotavirus A strain DRC88 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 15298
$5 ? : TGTGGAGATATGACC
$6 ? : IIIIIIIIIIIIIII
$7 ? : 1
<<7
>>8
$1 ? : gi|56638504|gb|AR590739.1| Sequence 13 from patent US 6805867
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 12626
$5 ? : GGCTATTAAAGGT
$6 ? : IIIIIIIIIIIII
$7 ? : 1
$8 ? : 12:C>T
<<8
>>9
$1 ? : gi|10003339|gb|AR076593.1|AR076593 Sequence 32 from patent US 5959093
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 5633
$5 ? : GGATGGCCAACAGGATC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
$8 ? : 2:T>A,5:T>A
<<9
>>10
$1 ? : gi|10003338|gb|AR076592.1|AR076592 Sequence 31 from patent US 5959093
$2 ? : +
$3 ? : gi|5706619|gb|AF106283.1|AF106283 Human rotavirus G2 strain TA25 VP7 protein (VP7) gene, complete cds
$4 ? : 22
$5 ? : GTATGGTATTGAATATACCAC
$6 ? : IIIIIIIIIIIIIIIIIIIII
$7 ? : 16
<<10
>>11
$1 ? : gi|10003325|gb|AR076579.1|AR076579 Sequence 18 from patent US 5959093
$2 ? : -
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 7658
$5 ? : TAGTGAGAGGATGTGACC
$6 ? : IIIIIIIIIIIIIIIIII
$7 ? : 1
<<11
>>12
$1 ? : gi|10003324|gb|AR076578.1|AR076578 Sequence 17 from patent US 5959093
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 6320
$5 ? : GGCTTTTAAACGAAGTC
$6 ? : IIIIIIIIIIIIIIIII
$7 ? : 1
<<12
>>13
$1 ? : gi|10003321|gb|AR076575.1|AR076575 Sequence 14 from patent US 5959093
$2 ? : -
$3 ? : gi|66774335|gb|AH014892.1|SEG_DQ005104S Human rotavirus A strain DRC88 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 15298
$5 ? : TGTGGAGATATGACC
$6 ? : IIIIIIIIIIIIIII
$7 ? : 1
<<13
>>14
$1 ? : gi|10003320|gb|AR076574.1|AR076574 Sequence 13 from patent US 5959093
$2 ? : +
$3 ? : gi|78064499|gb|AH014893.1|SEG_DQ005115S Human rotavirus A strain DRC86 NSP5, NSP4, NSP3, NSP2, NSP1, VP7, VP6, VP4, VP3, VP2, and VP1 genes, complete cds
$4 ? : 12626
$5 ? : GGCTATTAAAGGT
$6 ? : IIIIIIIIIIIII
$7 ? : 1
$8 ? : 12:C>T
<<14
(the verticalize is a small utility I wrote, it transforms a horizontal output to a vertical one).


In the 9th result says : the reverse sequence of "gi|10003339" is GGATGGCCAACAGGATC . This sequence matches "gi|78064499" at position '5633' there are two mismatches:2:T>A,5:T>A. Control: running a blas2seq on those two sequences gives the following result:


>gb|AR076593.1|AR076593 Sequence 32 from patent US 5959093
Length=17

Score = 22.9 bits (24), Expect = 0.014
Identities = 15/17 (88%), Gaps = 0/17 (0%)
Strand=Plus/Minus


Query 5634 GGATGGCCAACTGGTTC 5650
||||||||||| || ||
Sbjct 17 GGATGGCCAACAGGATC 1


That's it
Pierre

27 May 2008

NCBI Blast+ XSLT => XHTML + SVG

This post was inspired by the article Processing and duplicons on human chromosomes sent by Paulo Nuin yesterday and the short discussion that followed on FriendFeed. Paulo described in this article how the processing tool was used to display an output of ncbi-blast.

Here I show how a XSLT stylesheet can be used to transform a Blast into a XHTML+SVG page.

The stylesheet described here is available [here]



Here is how it works, at the beginning we've got a blast output in XML (here the query was a murine histone vs the human genome)

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>blastn 2.2.5 [Nov-16-2002]</BlastOutput_version>
<BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jin
ghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new
generation of protein database search~programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_r
eference>
<BlastOutput_db>HumanGenome</BlastOutput_db>
<BlastOutput_query-ID>lcl|QUERY</BlastOutput_query-ID>
<BlastOutput_query-def>gi|34556456|gb|AY158922.2| Mus musculus histone protein Hist2h2ab gene, complet
e cds</BlastOutput_query-def>
<BlastOutput_query-len>1680</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>1</Parameters_sc-match>
<Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
<Parameters_gap-open>5</Parameters_gap-open>
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>D</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|32</Hit_id>
<Hit_def>chr6</Hit_def>
<Hit_accession>32</Hit_accession>
<Hit_len>170975699</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>444.541</Hsp_bit-score>
<Hsp_score>224</Hsp_score>
<Hsp_evalue>7.76885e-122</Hsp_evalue>
<Hsp_query-from>209</Hsp_query-from>
<Hsp_query-to>580</Hsp_query-to>
<Hsp_hit-from>27883967</Hsp_hit-from>
<Hsp_hit-to>27884338</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>335</Hsp_identity>
<Hsp_positive>335</Hsp_positive>
<Hsp_align-len>372</Hsp_align-len>
<Hsp_qseq>ATGTCTGGCCGTGGCAAACAGGGAGGCAAGGCCCGCGCCAAGGCCAAGTCGCGGTCTTCCCGGGCCGGGCTACAGTTCCC
GGTGGGGCGTGTGCACCGGCTGCTGCGCAAGGGCAACTACGCGGAGCGCGTGGGTGCCGGCGCGCCGGTATACATGGCGGCGGTGCTGGAGTACCTAACGGCCGAGATCC
TGGAGCTGGCGGGCAACGCGGCCCGCGACAACAAGAAGACGCGCATCATCCCGCGCCACCTGCAGCTGGCCATCCGCAACGACGAGGAGCTCAACAAGCTGCTGGGCAAA
GTGACGATCGCACAGGGCGGCGTCCTGCCCAACATCCAGGCCGTGCTGCTGCCCAAGAAGACCGAGAGCCAC</Hsp_qseq>
<Hsp_hseq>ATGTCTGGGCGTGGCAAGCAGGGAGGCAAAGCTCGCGCCAAGGCCAAGACCCGCTCTTCTCGGGCCGGGCTTCAGTTTCC
CGTAGGCCGAGTGCATCGCCTGCTCCGCAAAGGCAACTATGCGGAGCGGGTCGGTGCTGGAGCGCCGGTGTACCTGGCGGCGGTGCTGGAGTACCTGACCGCCGAGATCC
TGGAGCTGGCTGGCAACGCGGCCCGCGACAACAAGAAGACTCGCATCATCCCGCGTCACCTCCAGCTGGCCATCCGCAACGATGAGGAGCTCAACAAGCTTCTGGGCAAA
GTCACCATCGCACAGGGTGGCGTCCTGCCCAACATCCAGGCCGTGCTACTGCCCAAGAAGACCGAGAGCCAC</Hsp_hseq>
<Hsp_midline>|||||||| |||||||| ||||||||||| || ||||||||||||||| | || ||||| ||||||||||| |||||
|| || || || ||||| || ||||| ||||| |||||||| |||||||| || ||||| || |||||||| ||| |||||||||||||||||||||| || |||||||
||||||||||||| ||||||||||||||||||||||||||||| |||||||||||||| ||||| |||||||||||||||||||| ||||||||||||||||| ||||||
||||| || ||||||||||| ||||||||||||||||||||||||||||| ||||||||||||||||||||||||</Hsp_midline>
</Hsp>
<Hsp>
<Hsp_num>2</Hsp_num>
<Hsp_bit-score>420.753</Hsp_bit-score>
<Hsp_score>212</Hsp_score>
<Hsp_evalue>1.1255e-114</Hsp_evalue>
<Hsp_query-from>580</Hsp_query-from>
<Hsp_query-to>209</Hsp_query-to>
<Hsp_hit-from>27890126</Hsp_hit-from>
<Hsp_hit-to>27890497</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>-1</Hsp_hit-frame>
<Hsp_identity>332</Hsp_identity>
<Hsp_positive>332</Hsp_positive>
<Hsp_align-len>372</Hsp_align-len>
<Hsp_qseq>GTGGCTCTCGGTCTTCTTGGGCAGCAGCACGGCCTGGATGTTGGGCAGGACGCCGCCCTGTGCGATCGTCACTTTGCCCA
GCAGCTTGTTGAGCTCCTCGTCGTTGCGGATGGCCAGCTGCAGGTGGCGCGGGATGATGCGCGTCTTCTTGTTGTCGCGGGCCGCGTTGCCCGCCAGCTCCAGGATCTCG
GCCGTTAGGTACTCCAGCACCGCCGCCATGTATACCGGCGCGCCGGCACCCACGCGCTCCGCGTAGTTGCCCTTGCGCAGCAGCCGGTGCACACGCCCCACCGGGAACTG
TAGCCCGGCCCGGGAAGACCGCGACTTGGCCTTGGCGCGGGCCTTGCCTCCCTGTTTGCCACGGCCAGACAT</Hsp_qseq>
<Hsp_hseq>GTGGCTCTCAGTTTTCTTTGGCAGCAGCACGGCCTGGATGTTGGGCAGGACGCCACCCTGTGCGATGGTGACTTTGCCCA
GAAGCTTGTTGAGCTCCTCATCGTTGCGGATGGCCAGCTGGAGGTGACGCGGGATGATGCGAGTCTTCTTGTTGTCGCGGGCCGCGTTGCCAGCCAGCTCCAGGATCTCG
GCGGTCAGGTACTCCAGCACCGCCGCCAGGTACACCGGCGCTCCAGCACCGACCCGCTCCGCATAGTTGCCTTTGCGGAGCAGGCGATGCACTCGGCCTACGGGAAACTG
AAGCCCGGCCCGAGAAGAGCGGGTCTTGGCCTTGGCGCGAGCTTTGCCTCCCTGCTTACCACGCCCAGACAT</Hsp_hseq>
<Hsp_midline>||||||||| || ||||| ||||||||||||||||||||||||||||||||||| ||||||||||| || |||||||
|||| ||||||||||||||||| |||||||||||||||||||| ||||| |||||||||||||| ||||||||||||||||||||||||||||| |||||||||||||||
||||| || |||||||||||||||||||||| ||| |||||||| || ||||| || |||||||| |||||||| ||||| ||||| || ||||| || || || || ||
||| ||||||||||| ||||| || | ||||||||||||||| || ||||||||||| || ||||| ||||||||</Hsp_midline>
</Hsp>

(...)

</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>1</Statistics_db-num>
<Statistics_db-len>245522847</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>5.13463e+12</Statistics_eff-space>
<Statistics_kappa>0.710605</Statistics_kappa>
<Statistics_lambda>1.37407</Statistics_lambda>
<Statistics_entropy>1.30725</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>


And the XSLT stylesheet:
1 <?xml version="1.0" encoding="UTF-8"?>
2 <xsl:stylesheet
3 version="1.0"
4 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
5 xmlns:svg="http://www.w3.org/2000/svg"
6 xmlns:xlink="http://www.w3.org/1999/xlink"
7 xmlns:h="http://www.w3.org/1999/xhtml"
8 >


17 <!-- ========================================================================= -->
18 <xsl:output method='xml' indent='yes' omit-xml-declaration="no"/>
19 <!-- we preserve the spaces in that element -->
20 <xsl:preserve-space elements="svg:style h:style" />
21
22 <!-- ========================================================================= -->
23 <!-- the width of the SVG -->
24 <xsl:variable name="svg-width">800</xsl:variable>
25 <!-- height of a HSP -->
26 <xsl:variable name="hsp-height">10</xsl:variable>
27 <!-- total number of Hits in first blast iteration -->
28 <xsl:variable name="hit-count"><xsl:value-of select="count(BlastOutput/BlastOutput_iterations/Iteration[1]/Iteration_hits/Hit)"/></xsl:variable>
29 <!-- total number of HSP in first blast iteration -->
30 <xsl:variable name="hsp-count"><xsl:value-of select="count(BlastOutput/BlastOutput_iterations/Iteration[1]/Iteration_hits/Hit/Hit_hsps/Hsp)"/></xsl:variable>
31 <!-- query length (bases or amino acids ) -->
32 <xsl:variable name="query-length"><xsl:value-of select="BlastOutput/BlastOutput_query-len"/></xsl:variable>
33 <!-- margin between two hits -->
34 <xsl:variable name="space-between-hits"><xsl:value-of select="3* $hsp-height"/></xsl:variable>
35 <!-- height of all hits -->
36 <xsl:variable name="hits-height"><xsl:value-of select="$hsp-count * $hsp-height + ($hit-count + 1) * $space-between-hits"/></xsl:variable>
37 <!-- size of the top header -->
38 <xsl:variable name="header-height">50</xsl:variable>
39
40 <!-- ========================================================================= -->
41
42 <!-- matching the root node -->
43 <xsl:template match="/">
44 <!-- start XHTML -->
45 <h:html>
46 <h:head>
47 <h:style type="text/css">
48 body {
49 font-size:10px;
50 font-family:Helvetica;
51 background-color:rgb(150,150,150);
52 color:white;
53 }
54 </h:style>
55 <h:title><xsl:value-of select="BlastOutput/BlastOutput_query-def"/></h:title>
56 </h:head>
57 <h:body>
58 <h:h1>Blast Results</h:h1>
59 <h:div>
60 <h:h3>Parameters</h:h3>
61 <h:table>
62 <h:tr><h:th>Database</h:th><h:td><xsl:value-of select="BlastOutput/BlastOutput_db"/></h:td></h:tr>
63 <h:tr><h:th>Query ID</h:th><h:td><xsl:value-of select="BlastOutput/BlastOutput_query-ID"/></h:td></h:tr>
64 <h:tr><h:th>Query Def.</h:th><h:td><h:b><xsl:value-of select="BlastOutput/BlastOutput_query-def"/></h:b></h:td></h:tr>
65 <h:tr><h:th>Query Length</h:th><h:td><h:b><xsl:value-of select="BlastOutput/BlastOutput_query-len"/></h:b></h:td></h:tr>
66 <h:tr><h:th>Version</h:th><h:td><xsl:value-of select="BlastOutput/BlastOutput_version"/></h:td></h:tr>
67 <h:tr><h:th>Reference</h:th><h:td><h:a href="http://www.ncbi.nlm.nih.gov/pubmed/9254694"><xsl:value-of select="BlastOutput/BlastOutput_reference"/></h:a></h:td></h:tr>
68 </h:table>
69 </h:div>
70 <h:hr/>
71 <h:div style="text-align:center">
72
73 <!-- starts SVG figure -->
74 <xsl:element name="svg:svg">
75 <xsl:attribute name="version">1.0</xsl:attribute>
76 <xsl:attribute name="width"><xsl:value-of select="$svg-width"/></xsl:attribute>
77 <xsl:attribute name="height"><xsl:value-of select="$hits-height + $header-height "/></xsl:attribute>
78 <svg:title><xsl:value-of select="BlastOutput/BlastOutput_query-def"/></svg:title>
79 <svg:defs>
80 <svg:style type="text/css">
81 text.t1 {
82 fill:black;
83 font-size:<xsl:value-of select="$hsp-height - 2"/>px;
84 font-family:Helvetica;
85 }
86 text.t2 {
87 fill:blue;
88 font-size:<xsl:value-of select="$space-between-hits - 2"/>px;
89 font-family:Helvetica;
90 text-anchor:middle;
91 }
92 text.title {
93 fill:white;
94 stroke:black;
95 font-size:12px;
96 font-family:Helvetica;
97 text-anchor:middle;
98 alignment-baseline:middle;
99 }
100 line.grid {
101 stroke:lightgray;
102 stroke-width:1.5px;
103 }
104
105 rect.hit {
106 fill:none;
107 stroke:darkgray;
108 stroke-width:1px;
109 }
110 </svg:style>
111
112 <svg:linearGradient x1="0%" y1="0%" x2="0%" y2="100%" id="score1">
113 <svg:stop offset="5%" stop-color="red" />
114 <svg:stop offset="50%" stop-color="whitesmoke" />
115 <svg:stop offset="95%" stop-color="red" />
116 </svg:linearGradient>
117 <svg:linearGradient x1="0%" y1="0%" x2="0%" y2="100%" id="score2">
118 <svg:stop offset="5%" stop-color="orange" />
119 <svg:stop offset="50%" stop-color="whitesmoke" />
120 <svg:stop offset="95%" stop-color="orange" />
121 </svg:linearGradient>
122 <svg:linearGradient x1="0%" y1="0%" x2="0%" y2="100%" id="score3">
123 <svg:stop offset="5%" stop-color="green" />
124 <svg:stop offset="50%" stop-color="whitesmoke" />
125 <svg:stop offset="95%" stop-color="green" />
126 </svg:linearGradient>
127 <svg:linearGradient x1="0%" y1="0%" x2="0%" y2="100%" id="score4">
128 <svg:stop offset="5%" stop-color="blue" />
129 <svg:stop offset="50%" stop-color="whitesmoke" />
130 <svg:stop offset="95%" stop-color="blue" />
131 </svg:linearGradient>
132 <svg:linearGradient x1="0%" y1="0%" x2="0%" y2="100%" id="score5">
133 <svg:stop offset="5%" stop-color="black" />
134 <svg:stop offset="50%" stop-color="whitesmoke" />
135 <svg:stop offset="95%" stop-color="black" />
136 </svg:linearGradient>
137 </svg:defs>
138
139 <xsl:element name="svg:rect">
140 <xsl:attribute name="x">0</xsl:attribute>
141 <xsl:attribute name="y">0</xsl:attribute>
142 <xsl:attribute name="width"><xsl:value-of select="$svg-width - 1"/></xsl:attribute>
143 <xsl:attribute name="height"><xsl:value-of select="$hits-height + $header-height "/></xsl:attribute>
144 <xsl:attribute name="fill">whitesmoke</xsl:attribute>
145 <xsl:attribute name="stroke">blue</xsl:attribute>
146 </xsl:element>
147
148
149 <xsl:apply-templates select="BlastOutput"/>
150 </xsl:element>
151 <!-- end SVG figure -->
152
153 </h:div>
154 <h:hr/>
155 <xsl:apply-templates select="BlastOutput/BlastOutput_param/Parameters"/>
156 <h:hr/>
157 <h:p><h:b>SVG</h:b> figure generated with <h:a href="http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/blast2svg.xsl">blast2svg</h:a>. <h:a href="http://plindenbaum.blogspot.com">Pierre Lindenbaum PhD</h:a> <h:i>( plindenbaum at yahoo dot fr )</h:i></h:p>
158
159 </h:body>
160 </h:html>
161 </xsl:template>
162 <!-- ========================================================================= -->
163 <!-- display parameters in a HTML table -->
164 <xsl:template match="Parameters">
165 <h:div>
166 <h:h3>Parameters</h:h3>
167 <h:table>
168 <h:tr><h:th>Expect</h:th><h:td><xsl:value-of select="Parameters_expect"/></h:td></h:tr>
169 <h:tr><h:th>Sc-match</h:th><h:td><xsl:value-of select="Parameters_sc-match"/></h:td></h:tr>
170 <h:tr><h:th>Sc-mismatch</h:th><h:td><xsl:value-of select="Parameters_sc-mismatch"/></h:td></h:tr>
171 <h:tr><h:th>Gap-open</h:th><h:td><xsl:value-of select="Parameters_gap-open"/></h:td></h:tr>
172 <h:tr><h:th>Gap-extend</h:th><h:td><xsl:value-of select="Parameters_gap-extend"/></h:td></h:tr>
173 <h:tr><h:th>Filter</h:th><h:td><xsl:value-of select="Parameters_filter"/></h:td></h:tr>
174 </h:table>
175 </h:div>
176 </xsl:template>
177
178
179 <!-- ========================================================================= -->
180 <xsl:template match="BlastOutput">
181 <!-- paint header -->
182 <svg:g>
183 <xsl:element name="svg:rect">
184 <xsl:attribute name="x">0</xsl:attribute>
185 <xsl:attribute name="y">0</xsl:attribute>
186 <xsl:attribute name="width"><xsl:value-of select="$svg-width - 1"/></xsl:attribute>
187 <xsl:attribute name="height"><xsl:value-of select="$header-height - 2"/></xsl:attribute>
188 <xsl:attribute name="fill">url(#score5)</xsl:attribute>
189 <xsl:attribute name="stroke">black</xsl:attribute>
190 </xsl:element>
191
192 <xsl:element name="svg:text">
193 <xsl:attribute name="x"><xsl:value-of select="$svg-width div 2"/></xsl:attribute>
194 <xsl:attribute name="y"><xsl:value-of select="$header-height div 2"/></xsl:attribute>
195 <xsl:attribute name="class">title</xsl:attribute>
196 <xsl:value-of select="BlastOutput_query-def"/> (len=<xsl:value-of select="BlastOutput_query-len"/> )
197 </xsl:element>
198 </svg:g>
199 <xsl:apply-templates select="BlastOutput_iterations/Iteration[1]/Iteration_hits"/>
200 </xsl:template>
201
202 <!-- ========================================================================= -->
203
204 <xsl:template match="Iteration_hits">
205 <xsl:apply-templates select="Hit"/>
206 </xsl:template>
207
208 <!-- ========================================================================= -->
209
210 <xsl:template match="Hit">
211 <!-- count number of preceding hits -->
212 <xsl:variable name="preceding-hits"><xsl:value-of select="count(preceding-sibling::Hit)"/></xsl:variable>
213 <!-- count number of preceding hsp -->
214 <xsl:variable name="preceding-hsp"><xsl:value-of select="count(preceding-sibling::Hit/Hit_hsps/Hsp)"/></xsl:variable>
215 <!-- calculate hieght of this part -->
216 <xsl:variable name="height"><xsl:value-of select="count(Hit_hsps/Hsp)*$hsp-height"/></xsl:variable>
217 <!-- translate this part verticaly -->
218 <xsl:element name="svg:g">
219 <xsl:attribute name="transform">translate(0,<xsl:value-of select="$header-height + $preceding-hsp * $hsp-height + ($preceding-hits + 1) * $space-between-hits "/>)</xsl:attribute>
220 <xsl:attribute name="id">hit-<xsl:value-of select="generate-id(.)"/></xsl:attribute>
221 <xsl:element name="svg:text">
222 <xsl:attribute name="x"><xsl:value-of select="$svg-width div 2"/></xsl:attribute>
223 <xsl:attribute name="y"><xsl:value-of select="-2"/></xsl:attribute>
224 <xsl:attribute name="class">t2</xsl:attribute>
225 <xsl:value-of select="Hit_def"/>
226 </xsl:element>
227 <xsl:element name="svg:rect">
228 <xsl:attribute name="x">0</xsl:attribute>
229 <xsl:attribute name="y">0</xsl:attribute>
230 <xsl:attribute name="width"><xsl:value-of select="$svg-width"/></xsl:attribute>
231 <xsl:attribute name="height"><xsl:value-of select="$height"/></xsl:attribute>
232 <xsl:attribute name="class">hit</xsl:attribute>
233 </xsl:element>
234
235 <xsl:call-template name="grid">
236 <xsl:with-param name="x" select="0"/>
237 <xsl:with-param name="d" select="20"/>
238 <xsl:with-param name="W" select="$svg-width"/>
239 <xsl:with-param name="H" select="$height"/>
240 </xsl:call-template>
241
242 <xsl:apply-templates select="Hit_hsps"/>
243
244
245
246 </xsl:element>
247 </xsl:template>
248
249 <!-- ========================================================================= -->
250 <!-- draw vertical lines , recursive template -->
251 <xsl:template name="grid">
252 <xsl:param name="x" select="0" />
253 <xsl:param name="d" select="20" />
254 <xsl:param name="W" select="0" />
255 <xsl:param name="H" select="0" />
256 <svg:line class="grid" x1="{$x}" x2="{$x}" y1="0" y2="{$H}"/>
257 <xsl:if test="$d + $x &lt; $W">
258 <xsl:call-template name="grid">
259 <xsl:with-param name="x" select="$d + $x"/>
260 <xsl:with-param name="d" select="$d"/>
261 <xsl:with-param name="W" select="$W"/>
262 <xsl:with-param name="H" select="$H"/>
263 </xsl:call-template>
264 </xsl:if>
265 </xsl:template>
266
267 <!-- ========================================================================= -->
268
269 <xsl:template match="Hit_hsps">
270 <xsl:apply-templates select="Hsp"/>
271 </xsl:template>
272
273
274 <!-- ========================================================================= -->
275 <xsl:template match="Hsp">
276 <!-- number of previous hsp in the same Hit -->
277 <xsl:variable name="preceding-hsp"><xsl:value-of select="count(preceding-sibling::Hsp)"/></xsl:variable>
278 <!-- get the 5' position of the hsp in the query -->
279 <xsl:variable name="hsp-left"><xsl:choose>
280 <xsl:when test="Hsp_query-from &lt; Hsp_query-to"><xsl:value-of select="Hsp_query-from"/></xsl:when>
281 <xsl:otherwise><xsl:value-of select="Hsp_query-to"/></xsl:otherwise>
282 </xsl:choose></xsl:variable>
283 <!-- get the 3' position of the hsp in the query -->
284 <xsl:variable name="hsp-right"><xsl:choose>
285 <xsl:when test="Hsp_query-from &lt; Hsp_query-to"><xsl:value-of select="Hsp_query-to"/></xsl:when>
286 <xsl:otherwise><xsl:value-of select="Hsp_query-from"/></xsl:otherwise>
287 </xsl:choose></xsl:variable>
288 <!-- 5' position on screen -->
289 <xsl:variable name="x1"><xsl:value-of select="($hsp-left div $query-length ) * $svg-width"/></xsl:variable>
290 <!-- 3' position on screen -->
291 <xsl:variable name="x2"><xsl:value-of select="($hsp-right div $query-length ) * $svg-width"/></xsl:variable>
292 <!-- label -->
293 <xsl:variable name="label"><xsl:value-of select="Hsp_hit-from"/> - <xsl:value-of select="Hsp_hit-to"/> (<xsl:choose>
294 <xsl:when test="Hsp_query-from &lt; Hsp_query-to">+</xsl:when>
295 <xsl:otherwise>-</xsl:otherwise></xsl:choose>) e=<xsl:value-of select="Hsp_evalue"/></xsl:variable>
296
297 <!-- translate this Hsp verticaly in its Hit -->
298 <xsl:element name="svg:g">
299 <xsl:attribute name="transform">translate(0,<xsl:value-of select="$preceding-hsp * $hsp-height"/>)</xsl:attribute>
300 <xsl:attribute name="id">hsp-<xsl:value-of select="generate-id(.)"/></xsl:attribute>
301 <xsl:attribute name="title"><xsl:value-of select="Hsp_evalue"/></xsl:attribute>
302
303 <!-- paint the Hsp Rectangle -->
304 <xsl:element name="svg:rect">
305 <xsl:attribute name="x"><xsl:value-of select="$x1"/></xsl:attribute>
306 <xsl:attribute name="y">2</xsl:attribute>
307 <xsl:attribute name="width"><xsl:value-of select="$x2 - $x1"/></xsl:attribute>
308 <xsl:attribute name="height"><xsl:value-of select="$hsp-height - 4"/></xsl:attribute>
309 <!-- choose a color according to the e-value -->
310 <xsl:attribute name="fill"><xsl:choose>
311 <xsl:when test="Hsp_evalue &lt; 1E-100">url(#score1)</xsl:when>
312 <xsl:when test="Hsp_evalue &lt; 1E-10">url(#score2)</xsl:when>
313 <xsl:when test="Hsp_evalue &lt; 0.1">url(#score3)</xsl:when>
314 <xsl:when test="Hsp_evalue &lt; 0">url(#score4)</xsl:when>
315 <xsl:otherwise>url(#score5)</xsl:otherwise>
316 </xsl:choose></xsl:attribute>
317 </xsl:element>
318
319 <!-- paint the label according to the position of the Hsp on screen -->
320 <xsl:choose>
321 <xsl:when test="$x2 &lt; (0.75 * $svg-width)">
322 <xsl:element name="svg:text">
323 <xsl:attribute name="class">t1</xsl:attribute>
324 <xsl:attribute name="x"><xsl:value-of select="$x2 + 10 "/></xsl:attribute>
325 <xsl:attribute name="y"><xsl:value-of select="$hsp-height -1"/></xsl:attribute>
326 <xsl:attribute name="text-anchor">start</xsl:attribute>
327 <xsl:value-of select="$label"/>
328 </xsl:element>
329 </xsl:when>
330 <xsl:when test="$x1 &gt; (0.25 * $svg-width)">
331 <xsl:element name="svg:text">
332 <xsl:attribute name="class">t1</xsl:attribute>
333 <xsl:attribute name="x"><xsl:value-of select="$x1 - 10 "/></xsl:attribute>
334 <xsl:attribute name="y"><xsl:value-of select="$hsp-height -1"/></xsl:attribute>
335 <xsl:attribute name="text-anchor">end</xsl:attribute>
336 <xsl:value-of select="$label"/>
337 </xsl:element>
338 </xsl:when>
339 <xsl:otherwise>
340 <xsl:element name="svg:text">
341 <xsl:attribute name="class">t1</xsl:attribute>
342 <xsl:attribute name="x"><xsl:value-of select="($x2 - $x1) div 2 "/></xsl:attribute>
343 <xsl:attribute name="y"><xsl:value-of select="$hsp-height -1"/></xsl:attribute>
344 <xsl:attribute name="text-anchor">middle</xsl:attribute>
345 <xsl:value-of select="$label"/>
346 </xsl:element>
347 </xsl:otherwise>
348 </xsl:choose>
349
350 </xsl:element>
351 </xsl:template>
352 <!-- ========================================================================= -->
353
354
355 </xsl:stylesheet>

Some random notes:
Line 22-38: I define a few variables such as the number of Hsp or the number of Hits
Line 43: matching the root., we start the XHTML document here
line 73: we start the SVG document here. It is embedded in the XHTML document
line 80-110: CSS can be used for SVG
line 112-137: we define a few gradients to colorize the Hsp. TODO: finding a better method to colorize according to its e-value/score
line 199: for the first iteration, the <Hit> templates are called
line 211: we need to count the number of preceding hits/hsp to know how much we should translate vertically this group of object
line 242: we loop over each hsp in this hit
line 276-292: again, we need to know the number of preceding hits/hsp to translate this hsp vertically. We also calculate the 5' and the 3' position of the Hit in the query.
line 304-307: here, we paint the hsp-rectangle
line 320-348: lousy method, we paint a label for this hsp, trying to find the best place (left/right/middle) to print the label.

The blast file was processed with xsltproc:
xsltproc --novalid blast2svg.xsl blast.xml > ~/blast.xhtml


A sample output is displayed below
blast2svg



Pierre