03 October 2008

Java Wrappers for the tables of the UCSC/GoldenPath

Years ago, Jim Kent(UCSC) (the author of the BLAT algorithm) published in the Linux Journal "autoSql and autoXml: Code Generators from the Genome Project" the tools generate database definitions for SQL, write C header files with your data definitions and function prototypes, write C code to get data to and from C structures and generate C code for an XML parser.

For example the following 'as' file (http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/lib/cytoBand.as is the definition of the table called cytoBand:

table cytoBand
"Describes the positions of cytogenetic bands with a chromosome"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position in genoSeq"
uint chromEnd; "End position in genoSeq"
string name; "Name of cytogenetic band"
string gieStain; "Giemsa stain results"
)


will be used to generate the following sql definition
# cytoBand.sql was originally generated by the autoSql program, which also 
# generated cytoBand.c and cytoBand.h. This creates the database representation of
# an object which can be loaded and saved from RAM in a fairly
# automatic way.

#Describes the positions of cytogenetic bands with a chromosome
CREATE TABLE cytoBand (
chrom varchar(255) not null, # Human chromosome number
chromStart int unsigned not null, # Start position in genoSeq
chromEnd int unsigned not null, # End position in genoSeq
name varchar(255) not null, # Name of cytogenetic band
gieStain varchar(255) not null, # Giemsa stain results
#Indices
PRIMARY KEY(chrom(12),chromStart),
UNIQUE(chrom(12),chromEnd)
);

the C code, and the C header.

As a java programmer I wanted to create my own wrappers to use the data of the UCSC. I wrote a custom ANT task(the code is available here) using the public mysql server of the UCSC to get the structure of each table (e.g. desc cytoBand) and the description of each table (e.g. select autoSqlDef from tableDescriptions where tableName="cytoBand"). Each structure is parsed and injected into an apache-velocity template (the template is available here).

Here is an example of a source generated by the ant task:

As you can see 'enum' and 'set' are transformed into java Enum, getter, tableModel (for gui/swing) are created. Each class also comes with some useful static methods creating the instances from a sql query. For example here is what I wrote today to grab the information about the genes/cytobands/hapmap about a set of snp.

 for(RsId rsid: rsSet)
{
PreparedStatement pstmt=con.prepareStatement("select * from snp129 where name=?");
pstmt.setString(1, rsid.getName());
Hg18Snp129 snp= Hg18Snp129.selectOneOrZero(pstmt.executeQuery());
if(snp==null)
{
cout().println(rsid.getName()+TAB+"##Not FOUND");
continue;
}

(... print information about this snp ...)

pstmt=con.prepareStatement("select * from cytoBand where chrom=? and chromStart<=? and chromEnd>=?");
pstmt.setString(1, snp.getChrom());
pstmt.setInt(2, snp.getChromStart());
pstmt.setInt(3, snp.getChromEnd());

for(Hg18CytoBand band:Hg18CytoBand.select(pstmt.executeQuery()))
{
(.. print info about this cytoband...)
}
cout().print(TAB);
pstmt=con.prepareStatement("select * from refGene where chrom=? and txStart<=? and txEnd>=?");
pstmt.setString(1, snp.getChrom());
pstmt.setInt(2, snp.getChromStart());
pstmt.setInt(3, snp.getChromEnd());
i=0;
for(Hg18RefGene gene : Hg18RefGene.select(pstmt.executeQuery()))
{
if(i>0) cout().print(",");
cout().print(gene.getName()+"/"+gene.getName2());
++i;
}
cout().print(TAB);

for(String hapmapDb:HAPMAPDB)
{
pstmt=con.prepareStatement("select * from "+hapmapDb+" where name=?");
pstmt.setString(1, snp.getName());
Hg18HapmapSnps hapmap= Hg18HapmapSnps.selectOneOrZero(pstmt.executeQuery());
if(hapmap==null)
{
(...print empty fields...)
}
else
{
(... print result)
}
}
}


Note: Hibernate is also a popular tool to map objects to databases. But here everything is read-only, (we don't need any transaction) and the relationships between the tables are rather complicated to be described using a mapping file (e.g. see the numerous "Connected Tables and Joining Fields" for the table knownGene).

That's it

Pierre

2 comments:

Jermdemo said...

interesting work - it might be worthwhile to have this robot introspect more about columns like exonStarts - which is really a comma-separated string representation of an array

Pierre Lindenbaum said...

You right, my Idea is to create a child class that is going to implement all those methods:

class XRefGene extends Hg18RefGene
{
public int[] getExonStartIndexes()
{
public int array[]=new int[getExonCount()];
(... fill this array )
return array;
}
}

Pierre