31 December 2010

Translating a DNA to a Protein using server-side javascript and C: my notebook

In my previous post , I used Node.js to translate a DNA to a protein on the Server-side, using javascript. In the following post, I again will translate a DNAn but this time by calling a specialized C program on the server side.

Source code


The C program

The C program reads a DNA string from stdin a translate it using the standard genetic code:
Compilation:
gcc -o /my/bin/path/translate translate.c

The Node.js script

When the Node.js server receive a DNA parameter, it spawns a new process to the C program and we write the DNA to this process via 'stdin'.
Each time a new 'data' event (containing the protein) is received, it is printed to the http response. At the end of the process, we close the stream by calling 'end()'.

test

> node-v0.2.5/node translate.js
Server running at http://127.0.0.1:8080

> curl -s "http://localhost:8080/?dna=ATGATGATAGATAGATATAGTAGATATGATCGTCAGCCATACG"
MMIDRYSRYDRQPY


That's it,

Pierre

Server-side javascript: translating a DNA with Node.js

(wikipedia) Node.js is an evented I/O framework for the V8 JavaScript engine on Unix-like platforms. It is intended for writing scalable (javascript-based) network programs such as web servers.

In the following post I will create a javascript server translating a DNA to a protein.

Installing Node.js

I've downloaded the sources for Node.js from http://nodejs.org/#download. It compiled (configure+make) and ran without any problem.

The script

The following script contains a class handling a GeneticCode and the server TranslateDna translating the DNA to a protein, it handles both the POST and the GET http methods. It no parameter is found it displays a simple HTML form, else the form data are decoded and the DNA is translated. The protein is returned as a JSON structure.

Running the server

> node-v0.2.5/node translate.js
Server running at http://127.0.0.1:8080

Test


> curl "http://localhost:8080/"
<html><body><form action="/" method="GET"><h1>DNA</h1><textarea name="dna"></textarea><br/><input type="submit" value="Submit"></form></body></html>

> curl "http://localhost:8080/?dna=ATGAACTATCGATGCTACGACTGATCG"
{"protein":"MNYRCYD*S","query":"ATGAACTATCGATGCTACGACTGATCG"}



That's it,

Pierre

14 December 2010

Looking for an expert ?

Yesterday, Andrew Su asked on Biostar: "Given a gene, what is the best automated method to identify the world experts? ".

Here is my solution:

  • First for a given gene name, we use NCBI-ESearch to find its Gene-Id in NCBI Gene
  • The Gene record is then downloaded as XML using NCBI-EFetch
  • XPATH is used to retrieve all the articles in pubmed about this gene and identified by the XML tags <PubMedId>
  • Each article is downloaded from pubmed. The element <Affiliation> is extracted from the record; sometimes this tag contains the the main contact's email. The authors are also extracted and we count the number of times each author was found. I tried to solve the problem of ambiguity for the names of the authors by looking at the name, surname and initials. If an author's name was contained in the e-mail, it was affected to him
  • At the end, all the authors are sorted in function of the number of times they were seen and the most prolific author is printed out.


Source code


Compilation

javac BioStar4296.java

Test

java BioStar4296 ZC3H7B eif4G1 PRNP

<?xml version="1.0" encoding="UTF-8"?>
<experts>
<gene name="ZC3H7B" geneId="23264" count-pmids="13">
<Person>
<firstName>Sumio</firstName>
<lastName>Sugano</lastName>
<pmid>8125298</pmid>
<pmid>9373149</pmid>
<pmid>14702039</pmid>
<affilitation>International and Interdisciplinary Studies, The University of Tokyo, Japan.</affilitation>
<affilitation>Institute of Medical Science, University of Tokyo, Japan.</affilitation>
<affilitation>Helix Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan.</affilitation>
</Person>
</gene>
<gene name="eif4G1" geneId="1981" count-pmids="106">
<Person>
<firstName>Nahum</firstName>
<lastName>Sonenberg</lastName>
<pmid>7651417</pmid>
<pmid>7935836</pmid>
<pmid>8449919</pmid>
(...)
<affilitation>Department of Biochemistry and McGill Cancer Center, McGill University, Montreal, H3G 1Y6, Quebec, Canada.</affilitation>
<affilitation>Department of Biochemistry, McGill University, Montreal, Quebec, Canada.</affilitation>
<affilitation>Laboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA.</affilitation>
(...)
</Person>
</gene>
<gene name="PRNP" geneId="5621" count-pmids="429">
<Person>
<firstName>John</firstName>
<lastName>Collinge</lastName>
<pmid>1352724</pmid>
<pmid>1677164</pmid>
<pmid>2159587</pmid>
<pmid>20583301</pmid>
(...)
<mail>j.collinge@ic.ac.uk</mail>
<affilitation>Krebs Institute for Biomolecular Research, Department of Molecular Biology and Biotechnology, University of Sheffield, Sheffield S10 2TN, UK.</affilitation>
<affilitation>MRC Prion Unit and Department of Neurogenetics, Imperial College School of Medicine at St. Mary's, London, United Kingdom. J.Collinge@ic.ac.uk</affilitation>
<affilitation>Division of Neuroscience (Neurophysiology), Medical School, University of Birmingham, Edgbaston, Birmingham, UK. sratte@pitt.edu</affilitation>
(...)
</Person>
</gene>
</experts>

about this result


  • ZC3H7B the result is wrong. In Dr Sugano's article (3 articles) ZC3H7B was present in among a large set of other genes used in his studies. The expert would be Dr D. Poncet, my former thesis advisor but he 'only' wrote two articles about this protein.
  • Eif4G1: I know Dr Sonenberg is the expert. His email wasn't found.
  • PRNP Collinge seems to be the expert. Dr Collinge's e-mail was detected.


That's it,

Pierre

13 December 2010

A new journal: BMC Open Research Computation #OpenResComp


Citing ''Aims & scope'':Open Research Computation publishes peer reviewed articles that describe the development, capacities, and uses of software designed for use by researchers in any field.

Submissions relating to software for use in any area of research are welcome as are articles dealing with algorithms, useful code snippets, as well as large applications or web services, and libraries.

Open Research Computation differs from other journals with a software focus in its requirement for the software source code to be made available under an Open Source Initiative compliant license, and in its assessment of the quality of documentation and testing of the software.

In addition to articles describing software Open Research Computation also welcomes submissions that review or describe developments relating to software based tools for research. These include, but are not limited to, reviews or proposals for standards, discussion of best practice in research software development, educational and support resources and tools for researchers that develop or use software based tools.


See also the insights from Cameron Neylon, Jan Aerts, Neil 10K Saunders ...

17 November 2010

BLAST/XML+Annotations

I recently asked on Biostar if it would be possible to align two sequences while displaying their respective annotations.

As both answers I received (SPICE and jalview ) require a graphical interface, I quickly wrote a command-line java program doing the job. This program reads a NCBI/BLAST XML output and, if the 'query' or the 'hit' definition lines start with "gi|....", it fetches the Genbank records and the annotations for the sequence and map them onto the alignments.

The program is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BlastAnnotation.java.

Do we need an external library parsing Blast?

No, the java binding compiler, ${JAVA_HOME}/bin/xjc, can generate a java parser from BLAST DTD:
xjc -d src -p sandbox.ncbi.blast -dtd http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/blast/BlastOutput.java
sandbox/ncbi/blast/BlastOutputIterations.java
sandbox/ncbi/blast/BlastOutputMbstat.java
sandbox/ncbi/blast/BlastOutputParam.java
sandbox/ncbi/blast/Hit.java
sandbox/ncbi/blast/HitHsps.java
sandbox/ncbi/blast/Hsp.java
sandbox/ncbi/blast/Iteration.java
sandbox/ncbi/blast/IterationHits.java
sandbox/ncbi/blast/IterationStat.java
sandbox/ncbi/blast/ObjectFactory.java
sandbox/ncbi/blast/Parameters.java
sandbox/ncbi/blast/Statistics.java


And do we need an external library parsing Genbank?

No, again xjc did the job:
xjc -d src -p sandbox.ncbi.gbc -dtd http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/gbc/INSDAltSeqData.java
sandbox/ncbi/gbc/INSDAltSeqDataItems.java
sandbox/ncbi/gbc/INSDAltSeqItem.java
sandbox/ncbi/gbc/INSDAltSeqItemInterval.java
sandbox/ncbi/gbc/INSDAltSeqItemIsgap.java
sandbox/ncbi/gbc/INSDAuthor.java
sandbox/ncbi/gbc/INSDComment.java
sandbox/ncbi/gbc/INSDCommentItem.java
sandbox/ncbi/gbc/INSDCommentParagraph.java
sandbox/ncbi/gbc/INSDCommentParagraphItems.java
sandbox/ncbi/gbc/INSDCommentParagraphs.java
sandbox/ncbi/gbc/INSDFeature.java
sandbox/ncbi/gbc/INSDFeatureIntervals.java
sandbox/ncbi/gbc/INSDFeaturePartial3.java
sandbox/ncbi/gbc/INSDFeaturePartial5.java
sandbox/ncbi/gbc/INSDFeatureQuals.java
sandbox/ncbi/gbc/INSDFeatureSet.java
sandbox/ncbi/gbc/INSDFeatureSetFeatures.java
sandbox/ncbi/gbc/INSDFeatureXrefs.java
sandbox/ncbi/gbc/INSDInterval.java
sandbox/ncbi/gbc/INSDIntervalInterbp.java
sandbox/ncbi/gbc/INSDIntervalIscomp.java
sandbox/ncbi/gbc/INSDKeyword.java
sandbox/ncbi/gbc/INSDQualifier.java
sandbox/ncbi/gbc/INSDReference.java
sandbox/ncbi/gbc/INSDReferenceAuthors.java
sandbox/ncbi/gbc/INSDReferenceXref.java
sandbox/ncbi/gbc/INSDSecondaryAccn.java
sandbox/ncbi/gbc/INSDSeq.java
sandbox/ncbi/gbc/INSDSeqAltSeq.java
sandbox/ncbi/gbc/INSDSeqCommentSet.java
sandbox/ncbi/gbc/INSDSeqFeatureSet.java
sandbox/ncbi/gbc/INSDSeqFeatureTable.java
sandbox/ncbi/gbc/INSDSeqKeywords.java
sandbox/ncbi/gbc/INSDSeqOtherSeqids.java
sandbox/ncbi/gbc/INSDSeqReferences.java
sandbox/ncbi/gbc/INSDSeqSecondaryAccessions.java
sandbox/ncbi/gbc/INSDSeqStrucComments.java
sandbox/ncbi/gbc/INSDSeqid.java
sandbox/ncbi/gbc/INSDSet.java
sandbox/ncbi/gbc/INSDStrucComment.java
sandbox/ncbi/gbc/INSDStrucCommentItem.java
sandbox/ncbi/gbc/INSDStrucCommentItems.java
sandbox/ncbi/gbc/INSDXref.java
sandbox/ncbi/gbc/ObjectFactory.java

Example


As an example I've aligned the "human eif4G1" (gi|303227906) with "Mus musculus eif4G1" (gi|56699433).
The very first lines of the BLAST report are:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dt
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>BLASTN 2.2.24+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch
<BlastOutput_db>n/a</BlastOutput_db>
<BlastOutput_query-ID>gi|303227906|ref|NM_198241.2|</BlastOutput_query-ID>
<BlastOutput_query-def>Homo sapiens eukaryotic translation initiation factor 4
<BlastOutput_query-len>5538</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>2</Parameters_sc-match>
<Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
<Parameters_gap-open>5</Parameters_gap-open>
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|303227906|ref|NM_198241.2|</Iteration_query-ID>
<Iteration_query-def>Homo sapiens eukaryotic translation initiation factor 4 g
<Iteration_query-len>5538</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|56699433|ref|NM_001005331.1|</Hit_id>
<Hit_def>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif
<Hit_accession>NM_001005331</Hit_accession>
<Hit_len>5460</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>6818.02</Hsp_bit-score>
<Hsp_score>7560</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>53</Hsp_query-from>
<Hsp_query-to>5538</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>5418</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>4820</Hsp_identity>
<Hsp_positive>4820</Hsp_positive>
<Hsp_gaps>138</Hsp_gaps>
<Hsp_align-len>5521</Hsp_align-len>
<Hsp_qseq>GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_hseq>GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_midline>||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||...
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

And here is the output of my program:

java -jar dist/blastannot.jar ~/jeter.blast.xml

QUERY: Homo sapiens eukaryotic translation initiation factor 4 gamma, 1 (EIF4G1), transcript variant 2, mRNA
ID:gi|303227906|ref|NM_198241.2| Len:5538
>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif4g1), transcript variant 2, mRNA
NM_001005331
id:gi|56699433|ref|NM_001005331.1| len:5460

e-value:0 gap:138 bitScore:6818.02

#####:############################################ exon 1..180 gene:EIF4G1
QUERY 000000053 GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000102
||||| ||||||||||||||||||||||||||||||||||||||||||||
HIT 000000001 GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000050
#####:############################################ exon 1..128 gene:Eif4g1



################################################## exon 1..180 gene:EIF4G1
QUERY 000000103 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000152
||||||||||||||||||||||||||||||||||||||||||||||||||
HIT 000000051 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000100
################################################## exon 1..128 gene:Eif4g1



############::::::::::###### exon 1..180 gene:EIF4G1
#:::::::::::::###::::: exon 181..237 gene:EIF4G1
QUERY 000000153 CGGCCGCGCGGACTGAAGGAGACTGAAGGCCCTCGGATGCCCAGAACCTG 000000202
|||||||||||| ||||||| |||
HIT 000000101 CGGCCGCGCGGA----------CTGAAGG-------------AGA----- 000000122
############----------#######-------------### gene 1..5460 gene:Eif4g1
############----------#######-------------### exon 1..128 gene:Eif4g1



::::::::::::::::::::::##:##:::::::# exon 181..237 gene:EIF4G1
############### exon 238..331 gene:EIF4G1
QUERY 000000203 TAGGCCGCACCGTGGACTTGTTCTTAATCGAGGGGGTGCTGGGGGGACCC 000000252
|| || ||||||||||||||||
HIT 000000123 ----------------------CTGAA-------GGTGCTGGGGGGACCC 000000143
----------------------##:##-------# exon 1..128 gene:Eif4g1
############### exon 129..222 gene:Eif4g1



#:###############################:###:############ exon 238..331 gene:EIF4G1
##############:###:############ CDS 272..5071 gene:EIF4G1
QUERY 000000253 TGATGTGGCACCAAATGAAATGAACAAAGCTCCACAGTCCACAGGCCCCC 000000302
| ||||||||||||||||||||||||||||||| ||| ||||||||||||
HIT 000000144 TAATGTGGCACCAAATGAAATGAACAAAGCTCCCCAGCCCACAGGCCCCC 000000193
#:###############################:###:############ exon 129..222 gene:Eif4g1
##############:###:############ CDS 163..4944 gene:Eif4g1


(...)


############:#:#:#####:######:#:########:##:###### exon 4890..5521 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 4948..5505 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 5174..5403 gene:EIF4G1
QUERY 000005319 TTGGTGTGTCTTGGGGTGGGGAGGGGCACCAACGCCTGCCCCTGGGGTCC 000005368
|||||||||||| | | ||||| |||||| | |||||||| || ||||||
HIT 000005201 TTGGTGTGTCTTTGCGGGGGGAAGGGCACTACCGCCTGCCTCTAGGGTCC 000005250
############:#:#:#####:######:#:########:##:###### exon 4760..5396 gene:Eif4g1



::##############:##########:###################### exon 4890..5521 gene:EIF4G1
::##############:##########:###################### STS 4948..5505 gene:EIF4G1
::##############:##########:####### STS 5174..5403 gene:EIF4G1
QUERY 000005369 TTTTTTTTATTTTCTGAAAATCACTCTCGGGACTGCCGTCCTCGCTGCTG 000005418
|||||||||||||| |||||||||| ||||||||||||||||||||||
HIT 000005251 --TTTTTTATTTTCTG-AAATCACTCTTGGGACTGCCGTCCTCGCTGCTG 000005297
--##############-##########:###################### exon 4760..5396 gene:Eif4g1



######################:#############:############# exon 4890..5521 gene:EIF4G1
######################:#############:############# STS 4948..5505 gene:EIF4G1
QUERY 000005419 GGGGCATATGCCCCAGCCCCTGTACCACCCCTGCTGTTGCCTGGGCAGGG 000005468
|||||||||||||||||||||| ||||||||||||| |||||||||||||
HIT 000005298 GGGGCATATGCCCCAGCCCCTGCACCACCCCTGCTGCTGCCTGGGCAGGG 000005347
######################:#############:############# exon 4760..5396 gene:Eif4g1



#:##-############################################: exon 4890..5521 gene:EIF4G1
#:##-################################# STS 4948..5505 gene:EIF4G1
###### polyA_signal 5496..5501 gene:EIF4G1
# polyA_site 5516 gene:EIF4G1
QUERY 000005469 GGAA-GGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAG 000005517
| || ||||||||||||||||||||||||||||||||||||||||||||
HIT 000005348 GAAAGGGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAA 000005397
#:##:############################################ exon 4760..5396 gene:Eif4g1



:::# exon 4890..5521 gene:EIF4G1
# polyA_site 5521 gene:EIF4G1
QUERY 000005518 CTCAAAAAAAAAAAAAAAAAA 000005538
||||||||||||||||||
HIT 000005398 AAAAAAAAAAAAAAAAAAAAA 000005418



That's it,
Pierre

29 October 2010

Revisions=f(time): and the winner is Charles Darwin

I've been playing with the mediawiki API to draw a timeline of popular scientists through the last centuries using the number of revisions in wikipedia as the Y-axis. The java code generating this timeline is available here:WikipediaBioEdits.java. This code:

  • loops over the articles having a Template:infobox scientist
  • gets the categories of an article and retrieves the birth/death dates threw the categories 'Births_by_year' and 'Deaths_by_year'
  • gets one picture in an article an resizes it to 64px
  • counts the number of revisions for an article


At the end, the generated timeline looks like this: (best viewed with firefox):
1192196227323502427250425812658273528122152515791633168717411795184919031957

That's it

Pierre