Yesterday, Andrew Su asked on Biostar: "Given a gene, what is the best automated method to identify the world experts? ".
Here is my solution:
- First for a given gene name, we use NCBI-ESearch to find its Gene-Id in NCBI Gene
- The Gene record is then downloaded as XML using NCBI-EFetch
- XPATH is used to retrieve all the articles in pubmed about this gene and identified by the XML tags <PubMedId>
- Each article is downloaded from pubmed. The element <Affiliation> is extracted from the record; sometimes this tag contains the the main contact's email. The authors are also extracted and we count the number of times each author was found. I tried to solve the problem of ambiguity for the names of the authors by looking at the name, surname and initials. If an author's name was contained in the e-mail, it was affected to him
- At the end, all the authors are sorted in function of the number of times they were seen and the most prolific author is printed out.
Source code
Compilation
javac BioStar4296.java
Test
java BioStar4296 ZC3H7B eif4G1 PRNP
<?xml version="1.0" encoding="UTF-8"?>
<experts>
<gene name="ZC3H7B" geneId="23264" count-pmids="13">
<Person>
<firstName>Sumio</firstName>
<lastName>Sugano</lastName>
<pmid>8125298</pmid>
<pmid>9373149</pmid>
<pmid>14702039</pmid>
<affilitation>International and Interdisciplinary Studies, The University of Tokyo, Japan.</affilitation>
<affilitation>Institute of Medical Science, University of Tokyo, Japan.</affilitation>
<affilitation>Helix Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan.</affilitation>
</Person>
</gene>
<gene name="eif4G1" geneId="1981" count-pmids="106">
<Person>
<firstName>Nahum</firstName>
<lastName>Sonenberg</lastName>
<pmid>7651417</pmid>
<pmid>7935836</pmid>
<pmid>8449919</pmid>
(...)
<affilitation>Department of Biochemistry and McGill Cancer Center, McGill University, Montreal, H3G 1Y6, Quebec, Canada.</affilitation>
<affilitation>Department of Biochemistry, McGill University, Montreal, Quebec, Canada.</affilitation>
<affilitation>Laboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA.</affilitation>
(...)
</Person>
</gene>
<gene name="PRNP" geneId="5621" count-pmids="429">
<Person>
<firstName>John</firstName>
<lastName>Collinge</lastName>
<pmid>1352724</pmid>
<pmid>1677164</pmid>
<pmid>2159587</pmid>
<pmid>20583301</pmid>
(...)
<mail>j.collinge@ic.ac.uk</mail>
<affilitation>Krebs Institute for Biomolecular Research, Department of Molecular Biology and Biotechnology, University of Sheffield, Sheffield S10 2TN, UK.</affilitation>
<affilitation>MRC Prion Unit and Department of Neurogenetics, Imperial College School of Medicine at St. Mary's, London, United Kingdom. J.Collinge@ic.ac.uk</affilitation>
<affilitation>Division of Neuroscience (Neurophysiology), Medical School, University of Birmingham, Edgbaston, Birmingham, UK. sratte@pitt.edu</affilitation>
(...)
</Person>
</gene>
</experts>
about this result
- ZC3H7B the result is wrong. In Dr Sugano's article (3 articles) ZC3H7B was present in among a large set of other genes used in his studies. The expert would be Dr D. Poncet, my former thesis advisor but he 'only' wrote two articles about this protein.
- Eif4G1: I know Dr Sonenberg is the expert. His email wasn't found.
- PRNP Collinge seems to be the expert. Dr Collinge's e-mail was detected.
That's it,
Pierre