The program I wrote, backlocate is available on github: https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BackLocate.java and uses the public mysql server of the UCSC.
- The input is the name of a gene and a mutation "{AA-wild}{position}{AA-mut}"
- A first SQLquery searches for the gene symbol in the table kgXref.
- A second SQL query searches for all the transcripts of the table knownGene having this kgXref
- The genomic DNA for a transcript is downloaded from the DAS-DNA server of the UCSC
- The protein, the mRNA and the genomic sequences are reconstituted to find the 3 possible bases of the mutated codon.
Example
Let's find the genomic position for EIF4G1 at position 240 in the protein (Note; this mutation steps over two exons on the transcript "uc010hxy.2":
echo -e "EIF4G1\tD240Y" | java -jar backlocate.jar
Result:
#User.Gene AA1 petide.pos.1 AA2 knownGene.name knownGene.strand knownGene.AA index0.in.rna codon base.in.rna chromosome index0.in.genomic exon
##uc003fnt.2
EIF4G1 D 240 Y uc003fnt.2 + D 717 GAC G chr3 184040214 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 718 GAC A chr3 184040215 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 719 GAC C chr3 184040216 Exon 7
##uc010hxy.2
EIF4G1 D 240 Y uc010hxy.2 + D 717 GAT G chr3 184038780 Exon 9
EIF4G1 D 240 Y uc010hxy.2 + D 718 GAT A chr3 184039069 Exon 10
EIF4G1 D 240 Y uc010hxy.2 + D 719 GAT T chr3 184039070 Exon 10
##uc003fnw.2
EIF4G1 D 240 Y uc003fnw.2 + D 717 GAT G chr3 184038780 Exon 8
EIF4G1 D 240 Y uc003fnw.2 + D 718 GAT A chr3 184039069 Exon 9
EIF4G1 D 240 Y uc003fnw.2 + D 719 GAT T chr3 184039070 Exon 9
##Warning ref aminod acid for uc003fnp.2 [240] is not the same (I/D)
EIF4G1 D 240 Y uc003fnp.2 + I 717 ATC A chr3 184039089 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 718 ATC T chr3 184039090 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 719 ATC C chr3 184039091 Exon 10
(...)
##uc003fnt.2
EIF4G1 D 240 Y uc003fnt.2 + D 717 GAC G chr3 184040214 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 718 GAC A chr3 184040215 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 719 GAC C chr3 184040216 Exon 7
##uc010hxy.2
EIF4G1 D 240 Y uc010hxy.2 + D 717 GAT G chr3 184038780 Exon 9
EIF4G1 D 240 Y uc010hxy.2 + D 718 GAT A chr3 184039069 Exon 10
EIF4G1 D 240 Y uc010hxy.2 + D 719 GAT T chr3 184039070 Exon 10
##uc003fnw.2
EIF4G1 D 240 Y uc003fnw.2 + D 717 GAT G chr3 184038780 Exon 8
EIF4G1 D 240 Y uc003fnw.2 + D 718 GAT A chr3 184039069 Exon 9
EIF4G1 D 240 Y uc003fnw.2 + D 719 GAT T chr3 184039070 Exon 9
##Warning ref aminod acid for uc003fnp.2 [240] is not the same (I/D)
EIF4G1 D 240 Y uc003fnp.2 + I 717 ATC A chr3 184039089 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 718 ATC T chr3 184039090 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 719 ATC C chr3 184039091 Exon 10
(...)
That's it,
Pierre
the UCSC program pslMap should do this, it requires a PSL alignment of known genes to genome, then can map gene coordinate to genome or vice versa.
ReplyDelete