## 28 January 2010

### Elementary school for Bioinformatics: Suffix Arrays

I was recently asked to map the flanking ends of a large set of SNPs on the human genome. As a perfect match algorithm was sufficient I've implemented a suffix array algorithm to map each flanking sequence. Via Wikipedia:A suffix array is an array of integers giving the starting positions of suffixes of a string in lexicographical order. It can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search.

Say, the sequence of your genome is `"gggtaaagctataactattgatcaggcgtt"`.
The array of integers giving the starting positions of the suffixes is:

`[00]    gggtaaagctataactattgatcaggcgtt[01]    ggtaaagctataactattgatcaggcgtt[02]    gtaaagctataactattgatcaggcgtt[03]    taaagctataactattgatcaggcgtt[04]    aaagctataactattgatcaggcgtt[05]    aagctataactattgatcaggcgtt[06]    agctataactattgatcaggcgtt[07]    gctataactattgatcaggcgtt[08]    ctataactattgatcaggcgtt[09]    tataactattgatcaggcgtt[10]    ataactattgatcaggcgtt[11]    taactattgatcaggcgtt[12]    aactattgatcaggcgtt[13]    actattgatcaggcgtt[14]    ctattgatcaggcgtt[15]    tattgatcaggcgtt[16]    attgatcaggcgtt[17]    ttgatcaggcgtt[18]    tgatcaggcgtt[19]    gatcaggcgtt[20]    atcaggcgtt[21]    tcaggcgtt[22]    caggcgtt[23]    aggcgtt[24]    ggcgtt[25]    gcgtt[26]    cgtt[27]    gtt[28]    tt[29]    t`
This array is sorted using the lexicographical order of the prefixes:
`[04]    aaagctataactattgatcaggcgtt[12]    aactattgatcaggcgtt[05]    aagctataactattgatcaggcgtt[13]    actattgatcaggcgtt[06]    agctataactattgatcaggcgtt[23]    aggcgtt[10]    ataactattgatcaggcgtt[20]    atcaggcgtt[16]    attgatcaggcgtt[22]    caggcgtt[26]    cgtt[08]    ctataactattgatcaggcgtt[14]    ctattgatcaggcgtt[19]    gatcaggcgtt[25]    gcgtt[07]    gctataactattgatcaggcgtt[24]    ggcgtt[00]    gggtaaagctataactattgatcaggcgtt[01]    ggtaaagctataactattgatcaggcgtt[02]    gtaaagctataactattgatcaggcgtt[27]    gtt[29]    t[03]    taaagctataactattgatcaggcgtt[11]    taactattgatcaggcgtt[09]    tataactattgatcaggcgtt[15]    tattgatcaggcgtt[21]    tcaggcgtt[18]    tgatcaggcgtt[28]    tt[17]    ttgatcaggcgtt`
Now, using a Binary Search we can quickly find where is the sequence `"aagctataacta"` in the genome:
`[04]    aaagctataactattgatcaggcgtt[12]    aactattgatcaggcgtt[05]    aagctataactattgatcaggcgttHIT!    `aagctataacta`[13]    actattgatcaggcgtt[06]    agctataactattgatcaggcgtt(...)`
I've implemented this basic algorithm in Java for my needs (finding the pairs of hits, etc...). The source code is available here:. The program was able to map 17E6 * 2 sequences on the human genome in about half a day and, as each sequence can be searched regardless of the others, I now plan to make the code multithreaded.

## Implementation (Samples)

### Building the index

//write genomic sequence in this dynamic byte array
ByteArrayOutputStream baos=new ByteArrayOutputStream(150000000);
//open the genomiic sequence
FileInputStream fin=new FileInputStream(chromFile);

//ignore the fasta header
int c=fin.read();
if(c!='>') throw new IOException("not starting with '>' "+chromFile);
///skip fasta header
while((c=fin.read())!=-1)
{
if(c=='\n') break;
}
//read the fasta sequence
byte array[]=new byte[1000000];
int nRead;
while((nRead=fin.read(array))!=-1)
{
int j=0;
//remove blanks
for(int i=0;i< nRead;++i)
{
if( Character.isWhitespace(array[i])||
Character.isDigit(array[i])) continue;
array[j]=(byte)Character.toUpperCase(array[i]);
if(array[j]=='>') throw new IOException("expected only one sequence in "+chromFile);
j++;
}
baos.write(array,0,j);
}
fin.close();
//get the genomic sequence as an array of bytes
this.chrom_sequence=baos.toByteArray();
baos=null;
LOG.info("OK in byte array. length= "+chrom_sequence.length);

//build the array of pointers
this.pointers=new int[chrom_sequence.length];
for(int i=0;i+this.flanking<= this.pointers.length;++i)
{
this.pointers[i]=i;
}

(...)
//sort the suffix array
quicksort(0,this.pointers.length-1);
(...)

### Binary Search

private int lower_bound(
int first, int last,
byte shortSeq[]
)
{
int len = last - first;
while (len > 0)
{
int half = len / 2;
int middle = first + half;
if (compare(middle,shortSeq)<0)
{
first = middle + 1;
len -= half + 1;
}
else
{
len = half;
}
}
return first;
}
(...)
byte shortSeq[]=...;
int i=lower_bound(0,this.pointers.length,shortSeq);
if(i>= this.pointers.length ) return;
while(i< this.pointers.length &&
compare(i, shortSeq)==0)
{
System.out.println(
shortName+"\t"+
chromName+"\t"+
strand+"\t"+
this.pointers[i]+"\t"+
(this.pointers[i]+shortSeq.length)
);
i++;
}

That's it
Pierre

## 26 January 2010

### QuoteEditor

It's 00H25 here, and while one of my programs is aligning some SNPs on the human genome, I'm going to describe the tiny tool I wrote this Week-end. QuoteEditor is a java editor used to store some famous quotes in a RDF/XML file. The java 'jar' and the documentation are available at :

.

## Usage

java -jar lindenb/build/quotes.jar [-f <rdf-file> ]
the optional parameter '-f' is the name of the XML/RDF file where the quotes will be added. If this file does not exists, it will be created by QuoteEditor (default is ~/quotes.rdf )

## The Interface

• Author: must be a URL or a name that must be resolved in wikipedia. For example 'Victor Hugo' will be resolved as http://en.wikipedia.org/wiki/Victor_Hugo
• Source: a free text. The source of the quote
• Date: a free text: A date for this quote
• Keywords: one or more keywords ( or URL ) , space delimited. Each keyword must be resolved as a Category in wikipedia. For example Scientists will be resolved as http://en.wikipedia.org/wiki/Category:Scientists
• Quote Source: a free text. Where this quote was found
• lang: the language for this quote

each time a new quote is saved it will be added at the end of the rdf-file.

## Result

Here is an example of RDF/XML file generated by QuoteEditor:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:q="urn:ontology:quotes:">
<q:Quote>
<q:quote xml:lang="fr">L'œil était dans la tombe et regardait Caïn.</q:quote>
<q:author rdf:resource="http://en.wikipedia.org/wiki/Victor Hugo"/>
<q:source>La Conscience, from La Légende des siècles (1859),</q:source>
<q:date>1859</q:date>
<q:origin>http://en.wikiquote.org/wiki/Victor_Hugo</q:origin>
<q:subject rdf:resource="http://en.wikipedia.org/wiki/Category:Murders"/>
<q:subject rdf:resource="http://en.wikipedia.org/wiki/Remorse"/>
<q:subject rdf:resource="http://en.wikipedia.org/wiki/Category:Emotions"/>
</q:Quote>
<q:Quote>
<q:quote xml:lang="fr">Des mouches pour des enfants espiègles, voilà ce que nous sommes pour les Dieux ;
Ils nous tuent pour se divertir.</q:quote>
<q:author rdf:resource="http://en.wikipedia.org/wiki/William Shakespeare"/>
<q:source>Le Roi Lear (1605), William Shakespeare</q:source>
<q:date>1605</q:date>
<q:origin>http://fr.wikiquote.org/wiki/William_Shakespeare</q:origin>
<q:subject rdf:resource="http://en.wikipedia.org/wiki/Category:God"/>
<q:subject rdf:resource="http://en.wikipedia.org/wiki/Category:Crimes"/>
<q:subject rdf:resource="http://en.wikipedia.org/wiki/Category:Flies"/>
<q:subject rdf:resource="http://en.wikipedia.org/wiki/Category:Recreation"/>
</q:Quote>
(...)
</rdf:RDF>

That's it
Pierre

PS: Hum, my program mapping the SNPs is still running... 200Mo generated so far...

## 10 January 2010

### What is the CSS style of that HTML element ?: CSSPopup, an extension for firefox

Trying to find the CSS style of a HTML element is a common task for me and I often look in the <style/> of the pages to try to find what can be "this inspiring CSS". So, I've created CSSPopup, a small extension for firefox. This extension appends a new button in the contextual menu that will print all the CSS selectors of the element that was clicked. For example when I clicked on "Welcome to NCBI" at http://www.ncbi.nlm.nih.gov/, the result was:

h1 {
font-size:32px;
font-weight:bold;
line-height:36px;
margin-bottom:21.4333px;
margin-top:21.4333px;
padding-left:16px;
-moz-column-gap:32px;
}

div {
}

div {
}

div {
}

div {
margin-bottom:0px;
margin-left:0px;
margin-right:0px;
margin-top:0px;
}

body {
margin-bottom:8px;
margin-left:8px;
margin-right:8px;
margin-top:8px;
}

html {
background-attachment:scroll;
background-color:transparent;
background-image:none;
background-position:0% 0%;
background-repeat:repeat;
border-collapse:separate;
border-spacing:0px 0px;
bottom:auto;
caption-side:top;
clear:none;
clip:auto;
color:rgb(0, 0, 0);
content:none;
counter-increment:none;
counter-reset:none;
cursor:auto;
direction:ltr;
display:block;
empty-cells:show;
float:none;
font-family:serif;
font-size:16px;
font-size-adjust:none;
font-style:normal;
font-variant:normal;
font-weight:400;
height:auto;
ime-mode:auto;
left:auto;
letter-spacing:normal;
line-height:19px;
list-style-image:none;
list-style-position:outside;
list-style-type:disc;
margin-bottom:0px;
margin-left:0px;
margin-right:0px;
margin-top:0px;
marker-offset:auto;
max-height:none;
max-width:none;
min-height:0px;
min-width:0px;
opacity:1;
outline-color:rgb(0, 0, 0);
outline-offset:0px;
outline-style:none;
outline-width:0px;
overflow:visible;
overflow-x:visible;
overflow-y:visible;
padding-bottom:0px;
padding-left:0px;
padding-right:0px;
padding-top:0px;
page-break-after:auto;
page-break-before:auto;
pointer-events:visiblepainted;
position:static;
quotes:"“" "”" "‘" "’";
right:auto;
table-layout:auto;
text-align:start;
text-decoration:none;
text-indent:0px;
text-rendering:auto;
text-shadow:none;
text-transform:none;
top:auto;
unicode-bidi:normal;
vertical-align:baseline;
visibility:visible;
white-space:normal;
width:auto;
word-spacing:normal;
word-wrap:normal;
z-index:auto;
-moz-appearance:none;
-moz-background-clip:border;
-moz-background-inline-policy:continuous;
-moz-background-origin:padding;
-moz-binding:none;
-moz-border-bottom-colors:none;
-moz-border-left-colors:none;
-moz-border-right-colors:none;
-moz-border-top-colors:none;
-moz-border-image:none;
-moz-border-radius-bottomleft:0px;
-moz-border-radius-bottomright:0px;
-moz-border-radius-topleft:0px;
-moz-border-radius-topright:0px;
-moz-box-align:stretch;
-moz-box-direction:normal;
-moz-box-flex:0;
-moz-box-ordinal-group:1;
-moz-box-orient:horizontal;
-moz-box-pack:start;
-moz-box-shadow:none;
-moz-box-sizing:content-box;
-moz-column-count:auto;
-moz-column-gap:16px;
-moz-column-width:auto;
-moz-column-rule-width:0px;
-moz-column-rule-style:none;
-moz-column-rule-color:rgb(0, 0, 0);
-moz-float-edge:content-box;
-moz-force-broken-image-icon:0;
-moz-image-region:auto;
-moz-outline-color:rgb(0, 0, 0);
-moz-outline-offset:0px;
-moz-outline-radius-bottomleft:0px;
-moz-outline-radius-bottomright:0px;
-moz-outline-radius-topleft:0px;
-moz-outline-radius-topright:0px;
-moz-outline-style:none;
-moz-outline-width:0px;
-moz-stack-sizing:stretch-to-fit;
-moz-transform:none;
-moz-transform-origin:50% 50%;
-moz-user-focus:none;
-moz-user-input:auto;
-moz-user-modify:read-only;
-moz-user-select:auto;
-moz-appearance:none;
-moz-user-select:auto;
}
The extension can be downloaded at http://code.google.com/p/lindenb/downloads/list and the source code is available at http://code.google.com/p/lindenb/source/browse/trunk/proj/tinyxul/csspopup/.

That's it,
Pierre

## 06 January 2010

### Transforming Pubmed to Simile/Exhibit with XSLT

Cameron Neylon recently asked on friendfeed:

"Advice request: What is the best approach to exposing a publications list online. Any good tools for generating html/xml/rdfa?".

In the comments, it was suggested to use Simile/Exhibit. This gave me the idea to write a XSLT stylesheet to transform a pubmed xml result to an Exhibit file. This XSLT stylesheet is available at:

## Usage

Save your pubmed result as XML in a file named pubmed_result.xml and invoke your favorite xslt processor:
xsltproc pubmed2exhibit.xsl ~/pubmed_result.xml > file.html
.
That's all ! You now have an interactive bibliography in a html file !
Note: the JSON data are embedded in the html file thanks to this hack.

## Result

I've tested the stylesheet with the following query. The Exhibit displays and filters (by year/journal/author) the articles:

as well as a timeline:

That's it !

Pierre

## 05 January 2010

### My Python notebook: displaying the state of the Genome Projects

This post if about my first Python program. I'm still a newbie so please, don't flame ! :-)

The NCBI hosts a XML file describing the state of the Genome Projects at ftp://ftp.ncbi.nih.gov/genomes/genomeprj/gp.xml. The very first lines of the file look like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gp:DocumentSet SYSTEM "gp4v.dtd">
<gp:DocumentSet xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://s
chemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:gp="gp">
<gp:Document>
<gp:ProjectID>1</gp:ProjectID>
<gp:IsPrivate>false</gp:IsPrivate>
<gp:eType>eGenome</gp:eType>
<gp:eMethod>eWGS</gp:eMethod>
<gp:eTechnologies>eUnspecified</gp:eTechnologies>
<gp:OriginDB>NCBI</gp:OriginDB>
<gp:CreateDate>03/05/2003 15:03:16</gp:CreateDate>
<gp:ModificationDate>07/18/2006 16:09:02</gp:ModificationDate>
<gp:ProjectName/>
<gp:OrganismName>Acidobacterium capsulatum ATCC 51196</gp:OrganismName>
<gp:StrainName>ATCC 51196</gp:StrainName>
<gp:TaxID>240015</gp:TaxID>
<gp:LocusTagPrefix/>
<gp:DNASource/>
<gp:SequencingDepth/>
<gp:ProjectURL>http://www.tigr.org/tdb/mdb/mdbinprogress.html</gp:ProjectURL>
<gp:DataURL/>
<gp:OrganismDescription/>
<gp:EstimatedGenomeSize>4.15</gp:EstimatedGenomeSize>
(...)
<gp:ChromosomeDefinitions/>
<gp:SubmittedSequences/>
<gp:LegacyPrefixes/>
</gp:Document>
<gp:Document>
<gp:ProjectID>3</gp:ProjectID>
<gp:IsPrivate>false</gp:IsPrivate>
<gp:eType>eGenome</gp:eType>
(...)
The program I wrote reads this XML file and generates the following visualization:
The left part is the tree of the NCBI taxonomy. In front of each taxon, the circles on the right are the Genome Projects. The size of a circle is the log10 of the estimated size of the genome (if available). The position on the x-axis of those circles is the date given by the tag "<p:CreateDate>".
There are two main classes: A Taxon is a node in the NCBI taxonomy, it contains a link to his children and to his parent:
class Taxon:
def __init__(self):
self.id=-1
self.name=None
self.parent=None
self.children=set()
. A Project is an item in the XML file. It contains a link to a Taxon:
class Project:
def __init__(self):
self.id=-1
self.taxon=None
self.genomeSize=None
self.date=None
The XML file is processed with a SAX handler. Each time a new project is found, the corresponding taxon is downloaded :
class GPHandler(handler.ContentHandler):
LIMIT=1E7
SLEEP=0
def __init__(self,owner):
self.owner=owner
self.content=None
self.project=None

(...)

def startElementNS(self,name,qname,attrs):
if name[1]=="Document":
self.project=Project()
elif name[1] in ["ProjectID","TaxID","CreateDate","EstimatedGenomeSize"] :
self.content=""

def endElementNS(self,name,qname):
if name[1]=="Document":
if len(self.owner.id2project)<GPHandler.LIMIT:
self.owner.id2project[self.project.id]= self.project
self.project=None
elif name[1]=="TaxID":
if len(self.owner.id2project)<GPHandler.LIMIT:
self.project.taxon=self.owner.findTaxonById(int(self.content))
time.sleep(GPHandler.SLEEP) #don't be evil with NCBI
elif name[1]=="ProjectID":
self.project.id=int(self.content)
sys.stderr.write("Project id:"+self.content+" n="+str(len(self.owner.id2project))+"\n")
elif name[1]=="EstimatedGenomeSize" and len(self.content)>0:
self.project.genomeSize=float(self.content)
elif name[1]=="CreateDate":
self.project.date=self.sql2date(self.content)
self.content=None
def characters(self,content):
if self.content!=None:
self.content+=content
The XML description of each taxons is downloaded and parsed using the NCBI EFetch service. A XML for a taxon looks like this:
<TaxaSet>
<Taxon>
<TaxId>240015</TaxId>
<ScientificName>Acidobacterium capsulatum ATCC 51196</ScientificName>
<OtherNames>
<EquivalentName>Acidobacterium capsulatum strain ATCC 51196</EquivalentName>
<EquivalentName>Acidobacterium capsulatum str. ATCC 51196</EquivalentName>
</OtherNames>
<ParentTaxId>33075</ParentTaxId>
<Rank>no rank</Rank>
<Division>Bacteria</Division>
<GeneticCode>
<GCId>11</GCId>
<GCName>Bacterial, Archaeal and Plant Plastid</GCName>
</GeneticCode>
<MitoGeneticCode>
<MGCId>0</MGCId>
<MGCName>Unspecified</MGCName>
</MitoGeneticCode>
<Lineage>cellular organisms; Bacteria; Fibrobacteres/Acidobacteria group; Acidobacteria; Acidobacteria (class); Acidobacteriales; Acidobacteriaceae; Acidobacterium; Acidobacterium capsulatum</Lineage>
<LineageEx>
<Taxon>
<TaxId>131567</TaxId>
<ScientificName>cellular organisms</ScientificName>
<Rank>no rank</Rank>
</Taxon>
<Taxon>
<TaxId>2</TaxId>
<ScientificName>Bacteria</ScientificName>
<Rank>superkingdom</Rank>
</Taxon>
<Taxon>
<TaxId>131550</TaxId>
<ScientificName>Fibrobacteres/Acidobacteria group</ScientificName>
<Rank>superphylum</Rank>
</Taxon>
<Taxon>
<TaxId>57723</TaxId>
<ScientificName>Acidobacteria</ScientificName>
<Rank>phylum</Rank>
</Taxon>
<Taxon>
<TaxId>204432</TaxId>
<ScientificName>Acidobacteria (class)</ScientificName>
<Rank>class</Rank>
</Taxon>
<Taxon>
<TaxId>204433</TaxId>
<ScientificName>Acidobacteriales</ScientificName>
<Rank>order</Rank>
</Taxon>
<Taxon>
<TaxId>204434</TaxId>
<ScientificName>Acidobacteriaceae</ScientificName>
<Rank>family</Rank>
</Taxon>
<Taxon>
<TaxId>33973</TaxId>
<ScientificName>Acidobacterium</ScientificName>
<Rank>genus</Rank>
</Taxon>
<Taxon>
<TaxId>33075</TaxId>
<ScientificName>Acidobacterium capsulatum</ScientificName>
<Rank>species</Rank>
</Taxon>
</LineageEx>
<CreateDate>2003/07/25</CreateDate>
<UpdateDate>2005/01/19</UpdateDate>
<PubDate>2005/01/29</PubDate>
</Taxon>
</TaxaSet>
This time I used a DOM implementation for python to analyse the file:
def findTaxonById(self,taxId):
if taxId in self.id2taxon:
return self.id2taxon[taxId]

url="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&retmode=xml&id="+str(taxId)
new_taxon=Taxon()
new_taxon.id=taxId
new_taxon.parent

dom = xml.dom.minidom.parse(urllib.urlopen(url))
top= dom.documentElement
e1= self.first(top,"Taxon")
new_taxon.name = self.textContent(self.first(e1,"ScientificName"))

lineage=[]
lineage.append(self.taxonRoot)
for e2 in self.elements(self.first(e1,"LineageEx"),"Taxon"):
t2=Taxon()
t2.id= int(self.textContent(self.first(e2,"TaxId")))
t2.name= self.textContent(self.first(e2,"ScientificName"))
if t2.id in self.id2taxon:
t2= self.id2taxon[t2.id]
else:
self.id2taxon[t2.id]=t2
lineage.append(t2)
lineage.append(new_taxon)

i=1
while i < len(lineage):
lineage[i-1].children.add(lineage[i])
lineage[i].parent=lineage[i-1]
i+=1

self.id2taxon[new_taxon.id]=new_taxon
return new_taxon
At the end, the figure is generated using SVG:

## Source code

#Author Pierre Lindenbaum
#Mail: plindenbaum@yahoo.fr
#WWW: http://plindenbaum.blogspot.com
#usage:
# wget -O gp.xml "ftp://ftp.ncbi.nih.gov/genomes/genomeprj/gp.xml"
# python gp.py gp.xml
from xml.sax import make_parser, handler,saxutils
import sys,time,math,pickle,os.path
import xml.dom.minidom
import urllib

class SVG:
NS="http://www.w3.org/2000/svg"
class XLINK:
NS="http://www.w3.org/1999/xlink"

class Main:
DB_FILE="gp.db"
def __init__(self):
self.id2project=dict()
self.id2taxon=dict()
self.min_size=1.0E13
self.max_size=0.0
self.min_date=1.0E13
self.max_date=0.0
self.taxonRoot=Taxon()
self.taxonRoot.id=0
self.taxonRoot.name="Tree of Life"
self.id2taxon[self.taxonRoot.id]=self.taxonRoot

def escape(self,s):
x=""
for c in s:
if c=='<':
x+="&lt;"
elif c=='>':
x+="&gt;"
elif c=='&':
x+="&amp;"
elif c=='\'':
x+="&apos;"
elif c=='\'':
x+="&quot;"
else:
x+=c
return x

def first(self,root,name):
for c1 in root.childNodes:
if c1.nodeType!= xml.dom.minidom.Node.ELEMENT_NODE or \
c1.nodeName!=name:
continue
return c1
return None
def elements(self,root,name):
a=[]
for c1 in root.childNodes:
if c1.nodeType!= xml.dom.minidom.Node.ELEMENT_NODE or \
c1.nodeName!=name:
continue
a.append(c1)
return a

def textContent(self,root):
content=""
for c1 in root.childNodes:
if c1.nodeType== xml.dom.minidom.Node.TEXT_NODE:
content+= c1.nodeValue
else:
content+=self.textContent(c1)
return content

def findTaxonById(self,taxId):
if taxId in self.id2taxon:
return self.id2taxon[taxId]

url="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&retmode=xml&id="+str(taxId)
sys.stderr.write(url+"\n")

new_taxon=Taxon()
new_taxon.id=taxId
new_taxon.parent

dom = xml.dom.minidom.parse(urllib.urlopen(url))
top= dom.documentElement
e1= self.first(top,"Taxon")
new_taxon.name = self.textContent(self.first(e1,"ScientificName"))

lineage=[]
lineage.append(self.taxonRoot)
for e2 in self.elements(self.first(e1,"LineageEx"),"Taxon"):
t2=Taxon()
t2.id= int(self.textContent(self.first(e2,"TaxId")))
t2.name= self.textContent(self.first(e2,"ScientificName"))
if t2.id in self.id2taxon:
t2= self.id2taxon[t2.id]
else:
self.id2taxon[t2.id]=t2
lineage.append(t2)
lineage.append(new_taxon)

i=1
while i < len(lineage):
lineage[i-1].children.add(lineage[i])
lineage[i].parent=lineage[i-1]
i+=1

self.id2taxon[new_taxon.id]=new_taxon
return new_taxon

def svg(self):
width= 1000
height=10000
for p in self.id2project.values():
if p.genomeSize!=None:
self.min_size = min(self.min_size, p.genomeSize)
self.max_size = max(self.max_size, p.genomeSize)
self.min_date = min(self.min_date, p.date)
self.max_date = max(self.max_date, p.date)
shift= (self.max_date-self.min_date)*0.05
self.min_date-=shift
self.max_date+=shift
sys.stderr.write(" size:"+str(self.min_size)+"/"+str(self.max_size))
sys.stderr.write(" date:"+str(self.min_date)+"/"+str(self.max_date))
print "<?xml version='1.0' encoding='UTF-8'?>"
print "<svg xmlns='"+SVG.NS+"' xmlns:xlink='"+XLINK.NS+"' "+\
"width='"+str(width+1)+"' "+\
"height='"+str(height+1)+"' "+\
"style='stroke:black;fill:none;stroke-width:0.5px;' "+\
"><title>Genome Projects</title>"
print "<rect x='0' y='0' width='"+str(width/2.0)+"' height='"+str(height)+"' style='stroke:green;'/>"
print "<rect x='"+str(1+width/2.0)+"' y='0' width='"+str(width/2.0)+"' height='"+str(height)+"' style='stroke:blue;'/>"
self.recurs(self.taxonRoot,0,0,width/2,height,(width/2.0)/self.taxonRoot.depth())
print "</svg>"
def recurs(self,taxon,x,y,width,height,h_size):
total=0.0
midy=y+height/2.0
print "<circle cx='"+str(x)+"' cy='"+str(midy)+"' r='"+str(2)+"' title='"+self.escape(taxon.name)+"' style='fill:black;'/>"

for p in self.id2project.values():
if p.taxon!=taxon:
continue
cx= width+ ((p.date - self.min_date)/(self.max_date - self.min_date))*width
cy= midy
print "<a target='ncbi' title='"+self.escape(taxon.name)+" projectId:"+str(p.id)+" size:"+str(p.genomeSize)+"' xlink:href='http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&amp;cmd=ShowDetailView&amp;TermToSearch="+str(p.id)+"'>"
print "<g style='stroke:red;' >"
if p.genomeSize!=None:
radius=max(2.0,((math.log10(p.genomeSize) - math.log10(self.min_size))/(math.log10(self.max_size) - math.log10(self.min_size)))*100.0)
print "<circle cx='"+str(cx)+"' cy='"+str(cy)+"' r='"+str(radius)+"' style='fill:orange;fill-opacity:0.1;'/>"
print "<line x1='"+str(cx-5)+"' y1='"+str(cy)+"' x2='"+str(cx+5)+"' y2='"+str(cy)+"'/>"
print "<line x1='"+str(cx)+"' y1='"+str(cy-5)+"' x2='"+str(cx)+"' y2='"+str(cy+5)+"'/>"
print "</g></a>"

for c in taxon.children:
total+= c.weight()
for c in taxon.children:
h2=height* (c.weight()/total)
y2=y+h2/2.0
w2=h_size
if len(c.children)==0:
w2=width-x
print "<polyline points='"+\
str(x)+","+str(midy)+" "+\
str(x+w2/2.0)+","+str(midy)+" "+\
str(x+w2/2.0)+","+str(y2)+" "+\
str(x+w2)+","+str(y2)+" "+\
"' title='"+self.escape(c.name)+"'/>"
self.recurs(c,x+w2,y,width,h2,w2)
y+=h2

class Taxon:
def __init__(self):
self.id=-1
self.name=None
self.parent=None
self.children=set()
def __hash__(self):
return self.id.__hash__()
def __eq__(self,other):
if other==None:
return False
return self.id==other.id
def weight(self):
w=1
for t in self.children:
w+= t.weight()
return w
def depth(self):
##sys.stderr.write(self.name+"\n")
d=0.0
for t in self.children:
d= max(t.depth(),d)
return 1.0+d

class Project:
def __init__(self):
self.id=-1
self.taxon=None
self.genomeSize=None
self.date=None

class GPHandler(handler.ContentHandler):
LIMIT=1E7
SLEEP=0
def __init__(self,owner):
self.owner=owner
self.content=None
self.project=None
def sql2date(self,s):
s=s[:s.index(' ')]
cal =time.mktime( time.strptime(s,"%m/%d/%Y") )
return cal
def startElementNS(self,name,qname,attrs):
if name[1]=="Document":
self.project=Project()
elif name[1] in ["ProjectID","TaxID","CreateDate","EstimatedGenomeSize"] :
self.content=""
def endElementNS(self,name,qname):
if name[1]=="Document":
if len(self.owner.id2project)<GPHandler.LIMIT:
self.owner.id2project[self.project.id]= self.project
self.project=None
elif name[1]=="TaxID":
if len(self.owner.id2project)<GPHandler.LIMIT:
self.project.taxon=self.owner.findTaxonById(int(self.content))
time.sleep(GPHandler.SLEEP) #don't be evil with NCBI
elif name[1]=="ProjectID":
self.project.id=int(self.content)
sys.stderr.write("Project id:"+self.content+" n="+str(len(self.owner.id2project))+"\n")
elif name[1]=="EstimatedGenomeSize" and len(self.content)>0:
self.project.genomeSize=float(self.content)
elif name[1]=="CreateDate":
self.project.date=self.sql2date(self.content)
self.content=None
def characters(self,content):
if self.content!=None:
self.content+=content

main=Main();
sys.stderr.write(main.escape("<>!")+"\n")
parser = make_parser()
# we want XML namespaces
parser.setFeature(handler.feature_namespaces,True)
# tell parser to disable validation
parser.setFeature(handler.feature_validation,False)
parser.setFeature(handler.feature_external_ges, False)
parser.setContentHandler(GPHandler(main))
parser.parse(sys.argv[1])

main.svg()

That's it,

Pierre