29 March 2011

Mapping a mutation on a protein to the genome.

A colleague asked me to solve the following problem: from an article in which a protein (don't dream, there was no accession number) was transferred, she wanted to know the position of the mutation on the human genome to determine whether a known SNP was there.
The program I wrote, backlocate is available on github: https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BackLocate.java and uses the public mysql server of the UCSC.
  • The input is the name of a gene and a mutation "{AA-wild}{position}{AA-mut}"
  • A first SQLquery searches for the gene symbol in the table kgXref.
  • A second SQL query searches for all the transcripts of the table knownGene having this kgXref
  • The genomic DNA for a transcript is downloaded from the DAS-DNA server of the UCSC
  • The protein, the mRNA and the genomic sequences are reconstituted to find the 3 possible bases of the mutated codon.

Example


Let's find the genomic position for EIF4G1 at position 240 in the protein (Note; this mutation steps over two exons on the transcript "uc010hxy.2":
echo -e "EIF4G1\tD240Y" | java -jar backlocate.jar

Result:
#User.Gene AA1 petide.pos.1 AA2 knownGene.name knownGene.strand knownGene.AA index0.in.rna codon base.in.rna chromosome index0.in.genomic exon
##uc003fnt.2
EIF4G1 D 240 Y uc003fnt.2 + D 717 GAC G chr3 184040214 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 718 GAC A chr3 184040215 Exon 7
EIF4G1 D 240 Y uc003fnt.2 + D 719 GAC C chr3 184040216 Exon 7
##uc010hxy.2
EIF4G1 D 240 Y uc010hxy.2 + D 717 GAT G chr3 184038780 Exon 9
EIF4G1 D 240 Y uc010hxy.2 + D 718 GAT A chr3 184039069 Exon 10
EIF4G1 D 240 Y uc010hxy.2 + D 719 GAT T chr3 184039070 Exon 10
##uc003fnw.2
EIF4G1 D 240 Y uc003fnw.2 + D 717 GAT G chr3 184038780 Exon 8
EIF4G1 D 240 Y uc003fnw.2 + D 718 GAT A chr3 184039069 Exon 9
EIF4G1 D 240 Y uc003fnw.2 + D 719 GAT T chr3 184039070 Exon 9
##Warning ref aminod acid for uc003fnp.2 [240] is not the same (I/D)
EIF4G1 D 240 Y uc003fnp.2 + I 717 ATC A chr3 184039089 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 718 ATC T chr3 184039090 Exon 10
EIF4G1 D 240 Y uc003fnp.2 + I 719 ATC C chr3 184039091 Exon 10
(...)


That's it,

Pierre

1 comment:

  1. the UCSC program pslMap should do this, it requires a PSL alignment of known genes to genome, then can map gene coordinate to genome or vice versa.

    ReplyDelete