17 November 2010

BLAST/XML+Annotations

I recently asked on Biostar if it would be possible to align two sequences while displaying their respective annotations.

As both answers I received (SPICE and jalview ) require a graphical interface, I quickly wrote a command-line java program doing the job. This program reads a NCBI/BLAST XML output and, if the 'query' or the 'hit' definition lines start with "gi|....", it fetches the Genbank records and the annotations for the sequence and map them onto the alignments.

The program is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BlastAnnotation.java.

Do we need an external library parsing Blast?

No, the java binding compiler, ${JAVA_HOME}/bin/xjc, can generate a java parser from BLAST DTD:
xjc -d src -p sandbox.ncbi.blast -dtd http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/blast/BlastOutput.java
sandbox/ncbi/blast/BlastOutputIterations.java
sandbox/ncbi/blast/BlastOutputMbstat.java
sandbox/ncbi/blast/BlastOutputParam.java
sandbox/ncbi/blast/Hit.java
sandbox/ncbi/blast/HitHsps.java
sandbox/ncbi/blast/Hsp.java
sandbox/ncbi/blast/Iteration.java
sandbox/ncbi/blast/IterationHits.java
sandbox/ncbi/blast/IterationStat.java
sandbox/ncbi/blast/ObjectFactory.java
sandbox/ncbi/blast/Parameters.java
sandbox/ncbi/blast/Statistics.java


And do we need an external library parsing Genbank?

No, again xjc did the job:
xjc -d src -p sandbox.ncbi.gbc -dtd http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/gbc/INSDAltSeqData.java
sandbox/ncbi/gbc/INSDAltSeqDataItems.java
sandbox/ncbi/gbc/INSDAltSeqItem.java
sandbox/ncbi/gbc/INSDAltSeqItemInterval.java
sandbox/ncbi/gbc/INSDAltSeqItemIsgap.java
sandbox/ncbi/gbc/INSDAuthor.java
sandbox/ncbi/gbc/INSDComment.java
sandbox/ncbi/gbc/INSDCommentItem.java
sandbox/ncbi/gbc/INSDCommentParagraph.java
sandbox/ncbi/gbc/INSDCommentParagraphItems.java
sandbox/ncbi/gbc/INSDCommentParagraphs.java
sandbox/ncbi/gbc/INSDFeature.java
sandbox/ncbi/gbc/INSDFeatureIntervals.java
sandbox/ncbi/gbc/INSDFeaturePartial3.java
sandbox/ncbi/gbc/INSDFeaturePartial5.java
sandbox/ncbi/gbc/INSDFeatureQuals.java
sandbox/ncbi/gbc/INSDFeatureSet.java
sandbox/ncbi/gbc/INSDFeatureSetFeatures.java
sandbox/ncbi/gbc/INSDFeatureXrefs.java
sandbox/ncbi/gbc/INSDInterval.java
sandbox/ncbi/gbc/INSDIntervalInterbp.java
sandbox/ncbi/gbc/INSDIntervalIscomp.java
sandbox/ncbi/gbc/INSDKeyword.java
sandbox/ncbi/gbc/INSDQualifier.java
sandbox/ncbi/gbc/INSDReference.java
sandbox/ncbi/gbc/INSDReferenceAuthors.java
sandbox/ncbi/gbc/INSDReferenceXref.java
sandbox/ncbi/gbc/INSDSecondaryAccn.java
sandbox/ncbi/gbc/INSDSeq.java
sandbox/ncbi/gbc/INSDSeqAltSeq.java
sandbox/ncbi/gbc/INSDSeqCommentSet.java
sandbox/ncbi/gbc/INSDSeqFeatureSet.java
sandbox/ncbi/gbc/INSDSeqFeatureTable.java
sandbox/ncbi/gbc/INSDSeqKeywords.java
sandbox/ncbi/gbc/INSDSeqOtherSeqids.java
sandbox/ncbi/gbc/INSDSeqReferences.java
sandbox/ncbi/gbc/INSDSeqSecondaryAccessions.java
sandbox/ncbi/gbc/INSDSeqStrucComments.java
sandbox/ncbi/gbc/INSDSeqid.java
sandbox/ncbi/gbc/INSDSet.java
sandbox/ncbi/gbc/INSDStrucComment.java
sandbox/ncbi/gbc/INSDStrucCommentItem.java
sandbox/ncbi/gbc/INSDStrucCommentItems.java
sandbox/ncbi/gbc/INSDXref.java
sandbox/ncbi/gbc/ObjectFactory.java

Example


As an example I've aligned the "human eif4G1" (gi|303227906) with "Mus musculus eif4G1" (gi|56699433).
The very first lines of the BLAST report are:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dt
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>BLASTN 2.2.24+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch
<BlastOutput_db>n/a</BlastOutput_db>
<BlastOutput_query-ID>gi|303227906|ref|NM_198241.2|</BlastOutput_query-ID>
<BlastOutput_query-def>Homo sapiens eukaryotic translation initiation factor 4
<BlastOutput_query-len>5538</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>2</Parameters_sc-match>
<Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
<Parameters_gap-open>5</Parameters_gap-open>
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|303227906|ref|NM_198241.2|</Iteration_query-ID>
<Iteration_query-def>Homo sapiens eukaryotic translation initiation factor 4 g
<Iteration_query-len>5538</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|56699433|ref|NM_001005331.1|</Hit_id>
<Hit_def>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif
<Hit_accession>NM_001005331</Hit_accession>
<Hit_len>5460</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>6818.02</Hsp_bit-score>
<Hsp_score>7560</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>53</Hsp_query-from>
<Hsp_query-to>5538</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>5418</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>4820</Hsp_identity>
<Hsp_positive>4820</Hsp_positive>
<Hsp_gaps>138</Hsp_gaps>
<Hsp_align-len>5521</Hsp_align-len>
<Hsp_qseq>GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_hseq>GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_midline>||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||...
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

And here is the output of my program:

java -jar dist/blastannot.jar ~/jeter.blast.xml

QUERY: Homo sapiens eukaryotic translation initiation factor 4 gamma, 1 (EIF4G1), transcript variant 2, mRNA
ID:gi|303227906|ref|NM_198241.2| Len:5538
>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif4g1), transcript variant 2, mRNA
NM_001005331
id:gi|56699433|ref|NM_001005331.1| len:5460

e-value:0 gap:138 bitScore:6818.02

#####:############################################ exon 1..180 gene:EIF4G1
QUERY 000000053 GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000102
||||| ||||||||||||||||||||||||||||||||||||||||||||
HIT 000000001 GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000050
#####:############################################ exon 1..128 gene:Eif4g1



################################################## exon 1..180 gene:EIF4G1
QUERY 000000103 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000152
||||||||||||||||||||||||||||||||||||||||||||||||||
HIT 000000051 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000100
################################################## exon 1..128 gene:Eif4g1



############::::::::::###### exon 1..180 gene:EIF4G1
#:::::::::::::###::::: exon 181..237 gene:EIF4G1
QUERY 000000153 CGGCCGCGCGGACTGAAGGAGACTGAAGGCCCTCGGATGCCCAGAACCTG 000000202
|||||||||||| ||||||| |||
HIT 000000101 CGGCCGCGCGGA----------CTGAAGG-------------AGA----- 000000122
############----------#######-------------### gene 1..5460 gene:Eif4g1
############----------#######-------------### exon 1..128 gene:Eif4g1



::::::::::::::::::::::##:##:::::::# exon 181..237 gene:EIF4G1
############### exon 238..331 gene:EIF4G1
QUERY 000000203 TAGGCCGCACCGTGGACTTGTTCTTAATCGAGGGGGTGCTGGGGGGACCC 000000252
|| || ||||||||||||||||
HIT 000000123 ----------------------CTGAA-------GGTGCTGGGGGGACCC 000000143
----------------------##:##-------# exon 1..128 gene:Eif4g1
############### exon 129..222 gene:Eif4g1



#:###############################:###:############ exon 238..331 gene:EIF4G1
##############:###:############ CDS 272..5071 gene:EIF4G1
QUERY 000000253 TGATGTGGCACCAAATGAAATGAACAAAGCTCCACAGTCCACAGGCCCCC 000000302
| ||||||||||||||||||||||||||||||| ||| ||||||||||||
HIT 000000144 TAATGTGGCACCAAATGAAATGAACAAAGCTCCCCAGCCCACAGGCCCCC 000000193
#:###############################:###:############ exon 129..222 gene:Eif4g1
##############:###:############ CDS 163..4944 gene:Eif4g1


(...)


############:#:#:#####:######:#:########:##:###### exon 4890..5521 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 4948..5505 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 5174..5403 gene:EIF4G1
QUERY 000005319 TTGGTGTGTCTTGGGGTGGGGAGGGGCACCAACGCCTGCCCCTGGGGTCC 000005368
|||||||||||| | | ||||| |||||| | |||||||| || ||||||
HIT 000005201 TTGGTGTGTCTTTGCGGGGGGAAGGGCACTACCGCCTGCCTCTAGGGTCC 000005250
############:#:#:#####:######:#:########:##:###### exon 4760..5396 gene:Eif4g1



::##############:##########:###################### exon 4890..5521 gene:EIF4G1
::##############:##########:###################### STS 4948..5505 gene:EIF4G1
::##############:##########:####### STS 5174..5403 gene:EIF4G1
QUERY 000005369 TTTTTTTTATTTTCTGAAAATCACTCTCGGGACTGCCGTCCTCGCTGCTG 000005418
|||||||||||||| |||||||||| ||||||||||||||||||||||
HIT 000005251 --TTTTTTATTTTCTG-AAATCACTCTTGGGACTGCCGTCCTCGCTGCTG 000005297
--##############-##########:###################### exon 4760..5396 gene:Eif4g1



######################:#############:############# exon 4890..5521 gene:EIF4G1
######################:#############:############# STS 4948..5505 gene:EIF4G1
QUERY 000005419 GGGGCATATGCCCCAGCCCCTGTACCACCCCTGCTGTTGCCTGGGCAGGG 000005468
|||||||||||||||||||||| ||||||||||||| |||||||||||||
HIT 000005298 GGGGCATATGCCCCAGCCCCTGCACCACCCCTGCTGCTGCCTGGGCAGGG 000005347
######################:#############:############# exon 4760..5396 gene:Eif4g1



#:##-############################################: exon 4890..5521 gene:EIF4G1
#:##-################################# STS 4948..5505 gene:EIF4G1
###### polyA_signal 5496..5501 gene:EIF4G1
# polyA_site 5516 gene:EIF4G1
QUERY 000005469 GGAA-GGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAG 000005517
| || ||||||||||||||||||||||||||||||||||||||||||||
HIT 000005348 GAAAGGGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAA 000005397
#:##:############################################ exon 4760..5396 gene:Eif4g1



:::# exon 4890..5521 gene:EIF4G1
# polyA_site 5521 gene:EIF4G1
QUERY 000005518 CTCAAAAAAAAAAAAAAAAAA 000005538
||||||||||||||||||
HIT 000005398 AAAAAAAAAAAAAAAAAAAAA 000005418



That's it,
Pierre

2 comments:

yannickwurm said...

Bravo!

what if you're working with a non-model non-genbank organism? (such as ants :))

Say you have only a gff...

cheers,
yannick

rashid1891 said...

this is very good man