03 April 2009

Consequences : SNP, cDNA, proteins, etc....

This post is about Consequences, a tool finding the consequences of a set of mutations mapped on the human genome. It was motivated by a recent post of FriendFeed, Daniel MacArthur asked:“Given a list of human b36 coordinates for a list of genic SNPs (most not in dbSNP), what would be the quickest way to get a list of the genes they're found in and, if possible, the amino acid position they would affect?”.

About one year ago, I wrote a tool named "Consequences" answering this question but the sources are somewhere in a tar.gz , burned in an old CD, in a cardboard, in my cellar... so it was faster to re-write this simple code from scratch. The result should be fine but please, tell me if you find a bug.

This tool takes as input a tab delimited file containing the following fields:
  1. A Name for your SNP
  2. the chromosome e.g. 'chr2' (at this time only one chromosome per input is supported)
  3. the position on the chromosome. The first base is indexed at 0
  4. The base observed ON THE PLUS STRAND OF THE GENOME
. The sequence of the chromosome is then downloaded using the DAS server of the UCSC, the genes are downloaded using the mysql server of the UCSC and the 'knownGene' table. Then, for each mutation, I simply look at the consequence of each mutation. Here is a sample of the output:

<consequences chrom="chr1">
<observed-mutation position="1116" name="snp1" base="A">
<gene name="uc001aaa.2" exon-count="3" strand="+" txStart="1115" txEnd="4121" cdsStart="1115" cdsEnd="1115">
<in-utr-3/>
</gene>
<gene name="uc009vip.1" exon-count="2" strand="+" txStart="1115" txEnd="4272" cdsStart="1115" cdsEnd="1115">
<in-utr-3/>
</gene>
</observed-mutation>
(...)
</observed-mutation>
<observed-mutation position="1149167" name="snp282" base="A">
<gene name="uc009vjv.1" exon-count="6" strand="-" txStart="1142150" txEnd="1157310" cdsStart="1142754" cdsEnd="1149171">
<in-exon name="Exon 2" codon-wild="CAG" codon-mut="TAG" aa-wild="Q" aa-mut="*" base-wild="C" base-mut="T" index-cdna="3" index-protein="1">
<wild-cDNA>ATG C AGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAACACAGGAAGTCCTGGAGAACCTGAAGGACCGCTGGTACCAGGCGGACAGCCCCCCTGCAGACCTGCTGCTGACGGAGGAGGAGTTCCTGTCGTTCCTCCACCCCGAGCACAGCCGGGGAATGCTCAGGTTCATGGTGAAGGAGATCGTCCGGGACCTGGACCAGGACGGTGACAAGCAGCTCTCTGTGCCCGAGTTCATCTCCCTGCCCGTGGGCACCGTGGAGAACCAGCAGGGCCAGGACATTGACGACAACTGGGTGAAAGACAGAAAAAAGGAGTTTGAGGAGCTCATTGACTCCAACCACGACGGCATCGTGACCGCCGAGGAGCTGGAGAGCTACATGGACCCCATGAACGAGTACAACGCGCTGAACGAGGCCAAGCAGATGATCGCCGTCGCCGACGAGAACCAGAACCACCACCTGGAGCCCGAGGAGGTGCTCAAGTACAGCGAGTTCTTCACGGGCAGCAAGCTGGTGGACTACGCGCGCAGCGTGCACGAGGAGTTTTGA</wild-cDNA>
<mut-cDNA>ATG T AGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAACACAGGAAGTCCTGGAGAACCTGAAGGACCGCTGGTACCAGGCGGACAGCCCCCCTGCAGACCTGCTGCTGACGGAGGAGGAGTTCCTGTCGTTCCTCCACCCCGAGCACAGCCGGGGAATGCTCAGGTTCATGGTGAAGGAGATCGTCCGGGACCTGGACCAGGACGGTGACAAGCAGCTCTCTGTGCCCGAGTTCATCTCCCTGCCCGTGGGCACCGTGGAGAACCAGCAGGGCCAGGACATTGACGACAACTGGGTGAAAGACAGAAAAAAGGAGTTTGAGGAGCTCATTGACTCCAACCACGACGGCATCGTGACCGCCGAGGAGCTGGAGAGCTACATGGACCCCATGAACGAGTACAACGCGCTGAACGAGGCCAAGCAGATGATCGCCGTCGCCGACGAGAACCAGAACCACCACCTGGAGCCCGAGGAGGTGCTCAAGTACAGCGAGTTCTTCACGGGCAGCAAGCTGGTGGACTACGCGCGCAGCGTGCACGAGGAGTTTTGA</mut-cDNA>
<wild-protein>M Q RWIMEKTAEHFQEAMEESKTHFRAVDPDGDGHVSWDEYKVKFLASKGHSEKEVADAIRLNEELKVDEETQEVLENLKDRWYQADSPPADLLLTEEEFLSFLHPEHSRGMLRFMVKEIVRDLDQDGDKQLSVPEFISLPVGTVENQQGQDIDDNWVKDRKKEFEELIDSNHDGIVTAEELESYMDPMNEYNALNEAKQMIAVADENQNHHLEPEEVLKYSEFFTGSKLVDYARSVHEEF*</wild-protein>
<mut-protein>M * RWIMEKTAEHFQEAMEESKTHFRAVDPDGDGHVSWDEYKVKFLASKGHSEKEVADAIRLNEELKVDEETQEVLENLKDRWYQADSPPADLLLTEEEFLSFLHPEHSRGMLRFMVKEIVRDLDQDGDKQLSVPEFISLPVGTVENQQGQDIDDNWVKDRKKEFEELIDSNHDGIVTAEELESYMDPMNEYNALNEAKQMIAVADENQNHHLEPEEVLKYSEFFTGSKLVDYARSVHEEF*</mut-protein>
</in-exon>
</gene>
<gene name="uc009vjw.1" exon-count="7" strand="-" txStart="1142150" txEnd="1157310" cdsStart="1142150" cdsEnd="1142150">
<in-utr-5/>
</gene>
</observed-mutation>
(...)
<observed-mutation position="1205906" name="snp195" base="A">
<gene name="uc001adt.1" exon-count="18" strand="+" txStart="1205678" txEnd="1217272" cdsStart="1205904" cdsEnd="1216853">
<in-exon name="Exon 1" codon-wild="ATG" codon-mut="ATA" aa-wild="M" aa-mut="I" base-wild="G" base-mut="A" index-cdna="2" index-protein="0">
<wild-cDNA>AT G AGGGCAGTGCTGTCACAGAAGACAACACCGCTCCCTCGTTACCTGTGGCCCGGCCACCTCAGCGGCCCAAGGAGGCTCACCTGGTCATGGTGCAGTGACCACAGGACCCCCACATGCCGGGAGCTGGGTTCGCCCCACCCCACCCCCTGCACCGGGCCAGCGAGGGGATGGCCCAGAAGAGGGGGAGGACCATGTGGATTCACCAGTGCTGGACATGTGCTCTGTGGCTACCCCCTCTGCCTACTCTCTGGCCCGATACAGGGGTGTGGGACAGGCCTGGGTGACTCCAGCATGGCTTTCCTCTCCAGGACGTCACCGGTGGCAGCTGCTTCCTTCCAGAGCCGGCAGGAGGCCAGAGGCTCCATCCTGCTTCAGAGCTGCCAGCTGCCCCCGCAATGGCTGAGCACCGAAGCATGGACGGGAGAATGGAAGCAGCCACACGGGGGGGCTCTCACCTCCAGATCGCCTGGGCCTGTGGCTCCCCAGAGGCCCTGCCACCTGAAGGGATGGCAGCACAGACCCACTCAGCACAACGCTGCCTGCAAACAGGGCCAGGCTGCAGCCCAGACGCCCCCCAGGCCGGGGCCACCATCAGCACCACCACCACCACCCAAGGAGGGGCACCAGGAGGGGCTGGTGGAGCTGCCCGCCTCGTTCCGGGAGCTGCTCACCTTCTTCTGCACCAATGCCACCATCCACGGCGCCATCCGCCTGGTCTGCTCCCGCGGGAACCGCCTCAAGACGACGTCCTGGGGGCTGCTGTCCCTGGGAGCCCTGGTCGCGCTCTGCTGGCAGCTGGGGCTCCTCTTTGAGCGTCACTGGCACCGCCCGGTCCTCATGGCCGTCTCTGTGCACTCGGAGCGCAAGCTGCTCCCGCTGGTCACCCTGTGTGACGGGAACCCACGTCGGCCGAGTCCGGTCCTCCGCCATCTGGAGCTGCTGGACGAGTTTGCCAGGGAGAACATTGACTCCCTGTACAACGTCAACCTCAGCAAAGGCAGAGCCGCCCTCTCCGCCACTGTCCCCCGCCACGAGCCCCCCTTCCACCTGGACCGGGAGATCCGTCTGCAGAGGCTGAGCCACTCGGGCAGCCGGGTCAGAGTGGGGTTCAGACTGTGCAACAGCACGGGCGGCGACTGCTTTTACCGAGGCTACACGTCAGGCGTGGCGGCTGTCCAGGACTGGTACCACTTCCACTATGTGGATATCCTGGCCCTGCTGCCCGCGGCATGGGAGGACAGCCACGGGAGCCAGGACGGCCACTTCGTCCTCTCCTGCAGTTACGATGGCCTGGACTGCCAGGCCCGACAGTTCCGGACCTTCCACCACCCCACCTACGGCAGCTGCTACACGGTCGATGGCGTCTGGACAGCTCAGCGCCCCGGCATCACCCACGGAGTCGGCCTGGTCCTCAGGGTTGAGCAGCAGCCTCACCTCCCTCTGCTGTCCACGCTGGCCGGCATCAGGGTCATGGTTCACGGCCGTAACCACACGCCCTTCCTGGGGCACCACAGCTTCAGCGTCCGGCCAGGGACGGAGGCCACCATCAGCATCCGAGAGGACGAGGTGCACCGGCTCGGGAGCCCCTACGGCCACTGCACCGCCGGCGGGGAAGGCGTGGAGGTGGAGCTGCTACACAACACCTCCTACACCAGGCAGGCCTGCCTGGTGTCCTGCTTCCAGCAGCTGATGGTGGAGACCTGCTCCTGTGGCTACTACCTCCACCCTCTGCCGGCGGGGGCTGAGTACTGCAGCTCTGCCCGGCACCCTGCCTGGGGACACTGCTTCTACCGCCTCTACCAGGACCTGGAGACCCACCGGCTCCCCTGTACCTCCCGCTGCCCCAGGCCCTGCAGGGAGTCTGCATTCAAGCTCTCCACTGGGACCTCCAGGTGGCCTTCCGCCAAGTCAGCTGGATGGACTCTGGCCACGCTAGGTGAACAGGGGCTGCCGCATCAGAGCCACAGACAGAGGAGCAGCCTGGCCAAAATCAACATCGTCTACCAGGAGCTCAACTACCGCTCAGTGGAGGAGGCGCCCGTGTACTCGGTGCCGCAGCTGCTCTCGGCCATGGGCAGCCTCTGCAGCCTGTGGTTTGGGGCCTCCGTCCTCTCCCTCCTGGAGCTCCTGGAGCTGCTGCTCGATGCTTCTGCCCTCACCCTGGTGCTAGGCGGCCGCCGGCTCCGCAGGGCGTGGTTCTCCTGGCCCAGAGCCAGCCCTGCCTCAGGGGCGTCCAGCATCAAGCCAGAGGCCAGTCAGATGCCCCCGCCTGCAGGCGGCACGTCAGATGACCCGGAGCCCAGCGGGCCTCATCTCCCACGGGTGATGCTTCCAGGGGTTCTGGCGGGAGTCTCAGCCGAAGAGAGCTGGGCTGGGCCCCAGCCCCTTGAGACTCTGGACACCTGA</wild-cDNA>
<mut-cDNA>AT A AGGGCAGTGCTGTCACAGAAGACAACACCGCTCCCTCGTTACCTGTGGCCCGGCCACCTCAGCGGCCCAAGGAGGCTCACCTGGTCATGGTGCAGTGACCACAGGACCCCCACATGCCGGGAGCTGGGTTCGCCCCACCCCACCCCCTGCACCGGGCCAGCGAGGGGATGGCCCAGAAGAGGGGGAGGACCATGTGGATTCACCAGTGCTGGACATGTGCTCTGTGGCTACCCCCTCTGCCTACTCTCTGGCCCGATACAGGGGTGTGGGACAGGCCTGGGTGACTCCAGCATGGCTTTCCTCTCCAGGACGTCACCGGTGGCAGCTGCTTCCTTCCAGAGCCGGCAGGAGGCCAGAGGCTCCATCCTGCTTCAGAGCTGCCAGCTGCCCCCGCAATGGCTGAGCACCGAAGCATGGACGGGAGAATGGAAGCAGCCACACGGGGGGGCTCTCACCTCCAGATCGCCTGGGCCTGTGGCTCCCCAGAGGCCCTGCCACCTGAAGGGATGGCAGCACAGACCCACTCAGCACAACGCTGCCTGCAAACAGGGCCAGGCTGCAGCCCAGACGCCCCCCAGGCCGGGGCCACCATCAGCACCACCACCACCACCCAAGGAGGGGCACCAGGAGGGGCTGGTGGAGCTGCCCGCCTCGTTCCGGGAGCTGCTCACCTTCTTCTGCACCAATGCCACCATCCACGGCGCCATCCGCCTGGTCTGCTCCCGCGGGAACCGCCTCAAGACGACGTCCTGGGGGCTGCTGTCCCTGGGAGCCCTGGTCGCGCTCTGCTGGCAGCTGGGGCTCCTCTTTGAGCGTCACTGGCACCGCCCGGTCCTCATGGCCGTCTCTGTGCACTCGGAGCGCAAGCTGCTCCCGCTGGTCACCCTGTGTGACGGGAACCCACGTCGGCCGAGTCCGGTCCTCCGCCATCTGGAGCTGCTGGACGAGTTTGCCAGGGAGAACATTGACTCCCTGTACAACGTCAACCTCAGCAAAGGCAGAGCCGCCCTCTCCGCCACTGTCCCCCGCCACGAGCCCCCCTTCCACCTGGACCGGGAGATCCGTCTGCAGAGGCTGAGCCACTCGGGCAGCCGGGTCAGAGTGGGGTTCAGACTGTGCAACAGCACGGGCGGCGACTGCTTTTACCGAGGCTACACGTCAGGCGTGGCGGCTGTCCAGGACTGGTACCACTTCCACTATGTGGATATCCTGGCCCTGCTGCCCGCGGCATGGGAGGACAGCCACGGGAGCCAGGACGGCCACTTCGTCCTCTCCTGCAGTTACGATGGCCTGGACTGCCAGGCCCGACAGTTCCGGACCTTCCACCACCCCACCTACGGCAGCTGCTACACGGTCGATGGCGTCTGGACAGCTCAGCGCCCCGGCATCACCCACGGAGTCGGCCTGGTCCTCAGGGTTGAGCAGCAGCCTCACCTCCCTCTGCTGTCCACGCTGGCCGGCATCAGGGTCATGGTTCACGGCCGTAACCACACGCCCTTCCTGGGGCACCACAGCTTCAGCGTCCGGCCAGGGACGGAGGCCACCATCAGCATCCGAGAGGACGAGGTGCACCGGCTCGGGAGCCCCTACGGCCACTGCACCGCCGGCGGGGAAGGCGTGGAGGTGGAGCTGCTACACAACACCTCCTACACCAGGCAGGCCTGCCTGGTGTCCTGCTTCCAGCAGCTGATGGTGGAGACCTGCTCCTGTGGCTACTACCTCCACCCTCTGCCGGCGGGGGCTGAGTACTGCAGCTCTGCCCGGCACCCTGCCTGGGGACACTGCTTCTACCGCCTCTACCAGGACCTGGAGACCCACCGGCTCCCCTGTACCTCCCGCTGCCCCAGGCCCTGCAGGGAGTCTGCATTCAAGCTCTCCACTGGGACCTCCAGGTGGCCTTCCGCCAAGTCAGCTGGATGGACTCTGGCCACGCTAGGTGAACAGGGGCTGCCGCATCAGAGCCACAGACAGAGGAGCAGCCTGGCCAAAATCAACATCGTCTACCAGGAGCTCAACTACCGCTCAGTGGAGGAGGCGCCCGTGTACTCGGTGCCGCAGCTGCTCTCGGCCATGGGCAGCCTCTGCAGCCTGTGGTTTGGGGCCTCCGTCCTCTCCCTCCTGGAGCTCCTGGAGCTGCTGCTCGATGCTTCTGCCCTCACCCTGGTGCTAGGCGGCCGCCGGCTCCGCAGGGCGTGGTTCTCCTGGCCCAGAGCCAGCCCTGCCTCAGGGGCGTCCAGCATCAAGCCAGAGGCCAGTCAGATGCCCCCGCCTGCAGGCGGCACGTCAGATGACCCGGAGCCCAGCGGGCCTCATCTCCCACGGGTGATGCTTCCAGGGGTTCTGGCGGGAGTCTCAGCCGAAGAGAGCTGGGCTGGGCCCCAGCCCCTTGAGACTCTGGACACCTGA</mut-cDNA>
<wild-protein> M RAVLSQKTTPLPRYLWPGHLSGPRRLTWSWCSDHRTPTCRELGSPHPTPCTGPARGWPRRGGGPCGFTSAGHVLCGYPLCLLSGPIQGCGTGLGDSSMAFLSRTSPVAAASFQSRQEARGSILLQSCQLPPQWLSTEAWTGEWKQPHGGALTSRSPGPVAPQRPCHLKGWQHRPTQHNAACKQGQAAAQTPPRPGPPSAPPPPPKEGHQEGLVELPASFRELLTFFCTNATIHGAIRLVCSRGNRLKTTSWGLLSLGALVALCWQLGLLFERHWHRPVLMAVSVHSERKLLPLVTLCDGNPRRPSPVLRHLELLDEFARENIDSLYNVNLSKGRAALSATVPRHEPPFHLDREIRLQRLSHSGSRVRVGFRLCNSTGGDCFYRGYTSGVAAVQDWYHFHYVDILALLPAAWEDSHGSQDGHFVLSCSYDGLDCQARQFRTFHHPTYGSCYTVDGVWTAQRPGITHGVGLVLRVEQQPHLPLLSTLAGIRVMVHGRNHTPFLGHHSFSVRPGTEATISIREDEVHRLGSPYGHCTAGGEGVEVELLHNTSYTRQACLVSCFQQLMVETCSCGYYLHPLPAGAEYCSSARHPAWGHCFYRLYQDLETHRLPCTSRCPRPCRESAFKLSTGTSRWPSAKSAGWTLATLGEQGLPHQSHRQRSSLAKINIVYQELNYRSVEEAPVYSVPQLLSAMGSLCSLWFGASVLSLLELLELLLDASALTLVLGGRRLRRAWFSWPRASPASGASSIKPEASQMPPPAGGTSDDPEPSGPHLPRVMLPGVLAGVSAEESWAGPQPLETLDT*</wild-protein>
<mut-protein> I RAVLSQKTTPLPRYLWPGHLSGPRRLTWSWCSDHRTPTCRELGSPHPTPCTGPARGWPRRGGGPCGFTSAGHVLCGYPLCLLSGPIQGCGTGLGDSSMAFLSRTSPVAAASFQSRQEARGSILLQSCQLPPQWLSTEAWTGEWKQPHGGALTSRSPGPVAPQRPCHLKGWQHRPTQHNAACKQGQAAAQTPPRPGPPSAPPPPPKEGHQEGLVELPASFRELLTFFCTNATIHGAIRLVCSRGNRLKTTSWGLLSLGALVALCWQLGLLFERHWHRPVLMAVSVHSERKLLPLVTLCDGNPRRPSPVLRHLELLDEFARENIDSLYNVNLSKGRAALSATVPRHEPPFHLDREIRLQRLSHSGSRVRVGFRLCNSTGGDCFYRGYTSGVAAVQDWYHFHYVDILALLPAAWEDSHGSQDGHFVLSCSYDGLDCQARQFRTFHHPTYGSCYTVDGVWTAQRPGITHGVGLVLRVEQQPHLPLLSTLAGIRVMVHGRNHTPFLGHHSFSVRPGTEATISIREDEVHRLGSPYGHCTAGGEGVEVELLHNTSYTRQACLVSCFQQLMVETCSCGYYLHPLPAGAEYCSSARHPAWGHCFYRLYQDLETHRLPCTSRCPRPCRESAFKLSTGTSRWPSAKSAGWTLATLGEQGLPHQSHRQRSSLAKINIVYQELNYRSVEEAPVYSVPQLLSAMGSLCSLWFGASVLSLLELLELLLDASALTLVLGGRRLRRAWFSWPRASPASGASSIKPEASQMPPPAGGTSDDPEPSGPHLPRVMLPGVLAGVSAEESWAGPQPLETLDT*</mut-protein>
</in-exon>
</gene>
<gene name="uc001adu.1" exon-count="17" strand="+" txStart="1205678" txEnd="1217272" cdsStart="1209267" cdsEnd="1216853">
<in-utr-5/>
</gene>
</observed-mutation>
(...)
</consequences>


The source code is available here:

A 'jar' is available for download at http://lindenb.googlecode.com/files/consequences.jar.
Running the tool:
java -cp {path}/mysql-connector-java-xxxx-bin.jar:consequences.jar org.lindenb.tinytools.Consequences your-list-of-snp.txt


Well, that is not big science but it might be helpful.
That's it.

Pierre

No comments:

Post a Comment