Java Wrappers for the tables of the UCSC/GoldenPath
Years ago, Jim Kent(UCSC) (the author of the BLAT algorithm) published in the Linux Journal "autoSql and autoXml: Code Generators from the Genome Project" the tools generate database definitions for SQL, write C header files with your data definitions and function prototypes, write C code to get data to and from C structures and generate C code for an XML parser.
For example the following 'as' file (http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/lib/cytoBand.as is the definition of the table called cytoBand:
table cytoBand
"Describes the positions of cytogenetic bands with a chromosome"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position in genoSeq"
uint chromEnd; "End position in genoSeq"
string name; "Name of cytogenetic band"
string gieStain; "Giemsa stain results"
)
will be used to generate the following sql definition
# cytoBand.sql was originally generated by the autoSql program, which also
# generated cytoBand.c and cytoBand.h. This creates the database representation of
# an object which can be loaded and saved from RAM in a fairly
# automatic way.
#Describes the positions of cytogenetic bands with a chromosome
CREATE TABLE cytoBand (
chrom varchar(255) not null, # Human chromosome number
chromStart int unsigned not null, # Start position in genoSeq
chromEnd int unsigned not null, # End position in genoSeq
name varchar(255) not null, # Name of cytogenetic band
gieStain varchar(255) not null, # Giemsa stain results
#Indices
PRIMARY KEY(chrom(12),chromStart),
UNIQUE(chrom(12),chromEnd)
);
the C code, and the C header.
As a java programmer I wanted to create my own wrappers to use the data of the UCSC. I wrote a custom ANT task(the code is available here) using the public mysql server of the UCSC to get the structure of each table (e.g.
desc cytoBand
) and the description of each table (e.g. select autoSqlDef from tableDescriptions where tableName="cytoBand"
). Each structure is parsed and injected into an apache-velocity template (the template is available here).Here is an example of a source generated by the ant task:
As you can see 'enum' and 'set' are transformed into java Enum, getter, tableModel (for gui/swing) are created. Each class also comes with some useful static methods creating the instances from a sql query. For example here is what I wrote today to grab the information about the genes/cytobands/hapmap about a set of snp.
for(RsId rsid: rsSet)
{
PreparedStatement pstmt=con.prepareStatement("select * from snp129 where name=?");
pstmt.setString(1, rsid.getName());
Hg18Snp129 snp= Hg18Snp129.selectOneOrZero(pstmt.executeQuery());
if(snp==null)
{
cout().println(rsid.getName()+TAB+"##Not FOUND");
continue;
}
(... print information about this snp ...)
pstmt=con.prepareStatement("select * from cytoBand where chrom=? and chromStart<=? and chromEnd>=?");
pstmt.setString(1, snp.getChrom());
pstmt.setInt(2, snp.getChromStart());
pstmt.setInt(3, snp.getChromEnd());
for(Hg18CytoBand band:Hg18CytoBand.select(pstmt.executeQuery()))
{
(.. print info about this cytoband...)
}
cout().print(TAB);
pstmt=con.prepareStatement("select * from refGene where chrom=? and txStart<=? and txEnd>=?");
pstmt.setString(1, snp.getChrom());
pstmt.setInt(2, snp.getChromStart());
pstmt.setInt(3, snp.getChromEnd());
i=0;
for(Hg18RefGene gene : Hg18RefGene.select(pstmt.executeQuery()))
{
if(i>0) cout().print(",");
cout().print(gene.getName()+"/"+gene.getName2());
++i;
}
cout().print(TAB);
for(String hapmapDb:HAPMAPDB)
{
pstmt=con.prepareStatement("select * from "+hapmapDb+" where name=?");
pstmt.setString(1, snp.getName());
Hg18HapmapSnps hapmap= Hg18HapmapSnps.selectOneOrZero(pstmt.executeQuery());
if(hapmap==null)
{
(...print empty fields...)
}
else
{
(... print result)
}
}
}
Note: Hibernate is also a popular tool to map objects to databases. But here everything is read-only, (we don't need any transaction) and the relationships between the tables are rather complicated to be described using a mapping file (e.g. see the numerous "Connected Tables and Joining Fields" for the table knownGene).
That's it
Pierre
2 comments:
interesting work - it might be worthwhile to have this robot introspect more about columns like exonStarts - which is really a comma-separated string representation of an array
You right, my Idea is to create a child class that is going to implement all those methods:
class XRefGene extends Hg18RefGene
{
public int[] getExonStartIndexes()
{
public int array[]=new int[getExonCount()];
(... fill this array )
return array;
}
}
Pierre
Post a Comment