Showing posts with label pubmed. Show all posts
Showing posts with label pubmed. Show all posts

27 May 2016

pubmed: extracting the 1st authors' gender and location who published in the Bioinformatics journal.

In this post I'll get some statistics about the 1st authors in the "Bioinformatics" journal from pubmed. I'll extract their genders and locations.
I'll use some tools I've already described some years ago but I've re-written them.

Downloading the data

To download the paper published in Bioinformatics, the pubmed/entrez query is '"Bioinformatics"[jour]'.
I use pubmeddump to download all those articles as XML from pubmed .
java -jar jvarkit/dist/pubmeddump.jar   '"Bioinformatics"[jour]'

Adding the authors' gender

PubmedGender is used to add two attributes '@male' or/and '@female' to the Pubmed/XML '<Author>' element.
<Author ValidYN="Y" male="169">
  <LastName>Lindenbaum</LastName>
  <ForeName>Pierre</ForeName>

Adding the authors' location

PubmedMap is used to add some attributes to the Pubmed/XML '<Affiliation>' element.
<Author>
 <LastName>Lai</LastName>
 <ForeName>Chih-Cheng</ForeName>
 <Initials>CC</Initials>
 <AffiliationInfo>
  <Affiliation domain="tw" place="Taiwan">Department of Intensive Care Medicine, Chi Mei Medical Center, Liouying, Tainan, Taiwan.</Affiliation>

Extracting the data from XML as a table

I use SAXScript to extract the data from XML.
A SAX parser is event-driven parser for XML. Here the events are invoked using a simple javascript program.
The script below will find the sex , the year of publication and the location of each 1st author of each article and print the results as text table.
/** current text content */
var content=null;
/** author position in the article */
var count_authors=0;
/** current author */
var author=null;
/** in element <PubDate> */
var in_pubdate=false;
/** current year */
var year=null;

 /** called when a new element XML is found */
function startElement(uri,localName,name,atts)
    {
 if(name=="PubDate")
  { in_pubdate=true;}
 else if(in_pubdate && name=="Year")
  { content="";}
    else if(name=="Author" && count_authors==0) {
  content="";
  /** get sex */
  var male = atts.getValue("male");
  var female = atts.getValue("female");
  var gender = (male==null?(female==null?null:"F"):"M");
  /* both male & female ? get the highest score */
  if(male!=null && female!=null)
   {
   var fm= parseInt(male);
   var ff= parseInt(female);
   gender= (fm>ff?"M":"F");
   }
  if(gender!=null) author={"sex":gender,"year":year,"domain":null};
  }
    else if(author!=null && name=="Affiliation") {
  author.domain = atts.getValue("domain");
  }
        }

/** in text node, append the text  */
function characters(s)
        {
        if(content!=null) content+=s;
        }

/** end of XML element */
function endElement(uri,localName,name)
        {
        if(name=="PubDate") { in_pubdate=false;}
        else if(in_pubdate && name=="Year") { year=content;}
        else if(name=="PubmedArticle" || name=="PubmedBookArticle")
   {
   count_authors=0;
   author=null;
   year=null;
   in_pubdate=false;
   }
        else if(name=="Author") {
   count_authors++;
   /* print first author */
   if(author!=null) {
    print(author.sex+"\t"+author.year+"\t"+author.domain);
    author=null;
    }
   }

        content=null;
        }

All in one

#download database of names
wget -O names.zip "https://www.ssa.gov/oact/babynames/names.zip" 
unzip -p names.zip yob2015.txt > names.csv
rm names.zip

java -jar jvarkit/dist/pubmeddump.jar   '"Bioinformatics"[jour]' |\
 java -jar jvarkit/dist/pubmedgender.jar  -d names.csv |\
 java -jar jvarkit/dist/pubmedmap.jar  |\
 java -jar src/jsandbox/dist/saxscript.jar -f pubmed.js > data.csv

The output (count, sex , year , country ):
$ cat data.csv  | sort | uniq -c | sort -n
(...)
    105 M 2015 us
    107 M 2004 us
    107 M 2013 us
    115 M 2008 us
    117 M 2011 us
    120 M 2009 us
    122 M 2010 us
    126 M 2014 us
    130 M 2012 us
    139 M 2005 us

That's it, Pierre

21 May 2016

Playing with the @ORCID_Org / @ncbi_pubmed graph. My notebook.

"ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. "
I've recently discovered that pubmed now integrates ORCID identfiers.

And there are several minor problems, I found some articles where the ORCID id is malformed or where different people use the same ORCID-ID:







You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID].
I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.
<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
  <!--Generated with PubmedOrcidGraph https://github.com/lindenb/jvarkit/wiki/PubmedOrcidGraph - Pierre Lindenbaum.-->
  <PubmedArticle pmid="27197243" doi="10.1101/gr.199760.115">
    <year>2016</year>
    <journal>Genome Res.</journal>
    <title>Improved definition of the mouse transcriptome via targeted RNA sequencing.</title>
    <Author orcid="0000-0002-4078-7413">
      <foreName>Giovanni</foreName>
      <lastName>Bussotti</lastName>
      <initials>G</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-4449-1863">
      <foreName>Tommaso</foreName>
      <lastName>Leonardi</lastName>
      <initials>T</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-6090-3100">
      <foreName>Anton J</foreName>
      <lastName>Enright</lastName>
      <initials>AJ</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
  </PubmedArticle>
  <PubmedArticle pmid="27197225" doi="10.1101/gr.204479.116">
    <year>2016</year>
    <journal>Genome Res.</journal>
(...)
Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statement.
<?xml version="1.0"?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0"
    xmlns:xalan="http://xml.apache.org/xalan"
    xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
    exclude-result-prefixes="xalan str"
 >
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>

<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values(<xsl:variable name="o2" select="@orcid"/>
<xsl:choose>
 <xsl:when test="str:strcmp( $o1 , $o2) < 0">'<xsl:value-of select='$o1'/>','<xsl:value-of select='$o2'/>'</xsl:when>
 <xsl:otherwise>'<xsl:value-of select='$o2'/>','<xsl:value-of select='$o1'/>'</xsl:otherwise>
</xsl:choose>);
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

This stylesheet contains an extension 'strmcp' for the xslt processor xalan to compare two XML strings
This extension is just used to always be sure that the field "orcid1" in the table "collab" is always lower than "orcid2" to avoid duplicates pairs.
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml

create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)
and those sql statetements are loaded into sqlite3:
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml |\
 sqlite3 orcid.sqlite

The next step is to produce a gexf+xml file to play with the orcid graph in gephi.
I use the following bash script to convert the sqlite3 database to gexf+xml.
DB=orcid.sqlite

cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">

<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF

sqlite3 -separator ' ' -noheader  ${DB} 'select orcid,name,affiliation from author' |\
 sed  -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g"  -e 's/"/\"/g' |\
 awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'

echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader  ${DB} 'select orcid1,orcid2 from collab' |\
 awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"



The output is saved and then loaded into gephi.






That's it,

Pierre

12 May 2014

Generating wikipedia semantic links from a pubmed-id

In "Building a biomedical semantic network in Wikipedia with Semantic Wiki Links" (Database . 2012 Mar 20;2012) Benjamin Good & al. introduced the Semantic Wiki Link (SWL):

An SWL is a hyperlink on Wikipedia that allows the editor to explicitly specify the type of relationship between the concept described on the page being edited and the concept that is being linked to (http://en.wikipedia.org/wiki/Template:SWL). These SWLs are implemented using MediaWiki templates.
(...)
any programmer can now write computer programs to parse Wikipedia content for SWLs and import them into third-party tools (e.g. triplestores, etc.)
Example: Phospholamban:
The protein encoded by this gene is found as a pentamer and is a major substrate for the cAMP-dependent protein kinase ({{SWL|type=substrate_for|target=protein kinase A|label=PKA}}) in cardiac muscle.




Using Entrez-Ajax (Loman & al.) and the Wikipedia API, I wrote a HTML+JS interface to accelerate the creation of a semantic SWL wiki-text from a PUBMED-id:


and.. well, that's it,

Pierre

24 October 2013

PubMed Commons & Bioinformatics: a call for action


NCBI pubmed Commons/@PubMedCommons is a new system that enables researchers to share their opinions about scientific publications. Researchers can comment on any publication indexed by PubMed, and read the comments of others.
Now that we can add some comments to the papers in pubmed, I suggest to flag the articles to mark the deprecated softwares, databases, hyperlinks using a simple controlled syntax. Here are a few examples: the line starts with '!MOVED' or '!NOTFOUND' and is followed by a URL or/and a list of PMID or/and a quoted comment.

Examples

!MOVED: for http://www.ncbi.nlm.nih.gov/pubmed/8392714 (Rebase/1993) to http://www.ncbi.nlm.nih.gov/pubmed/19846593 (Rebase/2010)
!MOVED PMID:19846593 "A more recent version" 
In http://www.ncbi.nlm.nih.gov/pubmed/19919682 the URL has moved to http://biogps.org.
!MOVED <http://biogps.org> 
I moved the sources of http://www.ncbi.nlm.nih.gov/pubmed/9682060 to github
!MOVED <https://github.com/lindenb/cloneit/tree/master/c> 
!NOTFOUND: for http://www.ncbi.nlm.nih.gov/pubmed/9545460 ( Biocat EXCEL template ) url http://www.ebi.ac.uk/biocat/biocat.html returns a 404.
!NOTFOUND "The URL http://www.ebi.ac.uk/biocat/biocat.html was not found " 


That's it,

Pierre

15 June 2013

Dear editors,...

Here is a screenshot of Thunderbird with the RSS feeds for "NCBI/pubmed: Exome Sequencing".


I'm pretty sure that all that (semantic) information is lost.

Dear editors, could you please ask the authors to complete or create an article in wikipedia about their paper once you have published it (possibly using a semantic template ?).

Thank you,

Pierre

See also We found a gene involved in a genetic disease. Now, what is the TODO list ?

25 March 2013

Embedding Pubmed, Graphiviz and a remote image in #LaTeX. My notebook. .

I'm learning LaTeX. Today I learned how to create a new command in LaTeX.

\newcommand{name}[num]{definition}
"Basically the command requires two arguments: the name of the command you want to create, and the definition of the command" . I played with LaTeX and wrote the following three commands:

Embedding a remote picture

The following LaTeX document defines a new command "\remoteimage". It takes 3 parameters: a filename, a URL and some parameters for \includegraphics. If the file doesn't exist, the url is downloaded and saved in 'file'. The downloaded image is then included in the final LaTeX document.

Note: latex files must be compiled with --enable-write18 to enable system-calls.
pdflatex --enable-write18 input.tex
Result:

External Image /Latex by lindenb


GraphViz Dot

The second LaTex Document works the same way. It defines a command "\graphviz" , sends the content of the 2nd argument to graphviz dot and save the resulting image before importing it into the LaTeX document.

Result:

GraphViz / Latex by lindenb


Pubmed

The last command define "\pmid" . It needs one Pubmed identifer. It downloads the XML record for this pmid, transforms it to LaTeX with xsltproc and the following XSLT stylesheet:

The LaTeX document includes four pubmed identifiers:

Result:

Pumed / Latex by lindenb






That's it,

Pierre




14 October 2012

Calculating time from submission to publication / Degree of burden in submitting a paper

After "404 not found": a database of non-functional resources in the NAR database collection, I've uploaded my second dataset on figshare:
Calculating time from submission to publication / Degree of burden in submitting a paper
.

Calculating time from submission to publication / Degree of burden in submitting a paper. Pierre Lindenbaum,  Ryan Delahanty.
figshare.
Retrieved 10:13, Oct 14, 2012 (GMT)
http://dx.doi.org/10.6084/m9.figshare.96403

This dataset was inspired by this post on biostar, initialy asked by Ryan Delahanty: I was wondering if it would be possible to calculate some kind of a metric for the speed-of-publication for each journal. I'm not sure submitted and accepted dates are available for all papers, but I noticed in XML data there are fields like the following:
<PubmedData>
        <History>
            <PubMedPubDate PubStatus="received">
                <Year>2011</Year>
                <Month>11</Month>
                <Day>29</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="accepted">
                <Year>2011</Year>
                <Month>12</Month>
                <Day>20</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
           (...)

In this dataset, the script 'pubmed.sh" downloads the the journals from http://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.pubmedhelptable45/ , the 'eigenfactors' from http://www.eigenfactor.org.

For each journal , It scans pubmed (starting from year=2000) and get the difference between the date[@PubStatus='received'] and the date[@PubStatus='accepted'].

titleissneigenfactordays
"Acta biochimica Polonica"0001-527X0.003996119.770935960591
"Acta biomaterialia"1742-70610.02152129.682692307692
"Acta biotheoretica"0001-53420.000844161.897058823529
"Acta cirurgica brasileira / Sociedade Brasileira para Desenvolvimento Pesquisa em Cirurgia"0102-86500.00128122.038461538462
"Acta cytologica"0001-55470.00230565.3006134969325
"Acta diabetologica"0940-54290.001851299.6
"Acta haematologica"0001-57920.002825118.654676258993
"Acta histochemica"0065-12810.002162110.471204188482
"Acta histochemica et cytochemica"0044-59910.00067781.6455696202532
"Acta neurochirurgica"0001-62680.009685204.371830985916
"Acta neuropathologica"0001-63220.02347169.7277882797732
"Acta theriologica"0001-70510.000901147.0
"Acta tropica"0001-706X0.01011196.577777777778
"Acta veterinaria Scandinavica"0044-605X0.00161282.0
"Addictive behaviors"0306-46030.017915163.049731182796
"Advances in space research "0273-11770.021217205.0
Ambio0044-74470.007463181.878048780488
"American journal of human genetics"0002-92970.12015667.1898928024502
"American journal of hypertension"0895-70610.017359104.074576271186
(....)

Here is the kind of figure I got:

As far as I remember, "Cell" is the point having the highest eigenfactor.


Note: pubmed contains some errors: e.g. received > accepted (http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20591334&retmode=xml) or some dates in the future: ( http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12921703&retmode=xml )


That's it,

Pierre

04 June 2011

Pubmed: sorting the articles on the number of times they've been cited

In 2008 I used www.eigenfactor.org/ to sort a set of Pubmed articles on the impact factor of the journal. In the current post I will show I've used NCBI ELink to sort the articles on the number of times they've have been cited in some other articles in pubmed-central.

The NCBI ELink API checks for the existence of an external or Related Articles link from a list of one or more primary IDs. It can be used to retrieve the article in pubmed central citing a given PMID.
For example, the the following uri: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=xml&dbfrom=pubmed&id=19755503&cmd=neighbor returns the 3 articles that cited the Gene Wiki paper:

(...) <LinkSetDb>
<DbTo>pubmed</DbTo>
<LinkName>pubmed_pubmed_citedin</LinkName>
<Link>
<Id>21516242</Id>
</Link>
<Link>
<Id>21062808</Id>
</Link>
<Link>
<Id>20334642</Id>
</Link>
</LinkSetDb>
(...)
.

I wrote a java program using this resource to sort the articles on the number of time they have been cited. The program is available on github at: .

Example

Let's sort the articles published in the 2005 NAR-Database Issue:
java -jar dist/pubmedsortbycitations.jar -c -L ALL -e '"Nucleic Acids Res"[JOUR] "Database issue"[ISS] 2005[PDAT]' > sorted.xml
OR
java -jar dist/pubmedsortbycitations.jar -c -L ALL pubmed_result_saved_as.xml > sorted.xml

The output is a sorted set of XML pubmed records.
The most cited article (290 references) is The Universal Protein Resource (UniProt)..
Some articles have never been cited: e.g.: Metagrowth: a new resource for the building of metabolic hypotheses in microbiology.

The '-c' option in the command line enables the program to insert a new XML node containing the PMID of the articles citing one article:
(...)
<ArticleId IdType="pubmed">15608167</ArticleId>
<ArticleId IdType="pmc">PMC540024</ArticleId>
</ArticleIdList>
</PubmedData>
<CitedBy count="290">
<PMID>15608199</PMID>
<PMID>15608238</PMID>
<PMID>15608243</PMID>
<PMID>15769290</PMID>
<PMID>15888679</PMID>
<PMID>15980452</PMID>
(...)
<PMID>21450054</PMID>
<PMID>21453542</PMID>
<PMID>21544166</PMID>
</CitedBy>
</PubmedArticle>



That's it,

Pierre

28 April 2011

dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions

People from the "Human Genetics Center" in Houston have compiled a new resource named dbNSFP and described in http://www.ncbi.nlm.nih.gov/pubmed/21520341.

Hum Mutat. 2011 Apr 21. doi:10.1002/humu.21517.
dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions.
Liu X, Jian X, Boerwinkle E.


They have compiled the "prediction scores from four new and popular algorithms (SIFT, Polyphen2, LRT and MutationTaster), along with a conservation score (PhyloP) and other related information, for every potential NS in the human genome (a total of 75,931,005)." .

So, you don't have to send some new jobs to SIFT or Polyphen. Everything has already been calculated and joined here.

The database is available from http://sites.google.com/site/jpopgen/dbNSFP.

Downloading

lindenb@yokofakun:~$ wget "http://dl.dropbox.com/u/17001647/dbNSFP/dbNSFP.chr1-22XY.zip"
--2011-04-27 13:50:26-- http://dl.dropbox.com/u/17001647/dbNSFP/dbNSFP.chr1-22XY.zip
Proxy request sent, awaiting response... 200 OK
Length: 1200703405 (1.1G) [application/zip]
Saving to: `dbNSFP.chr1-22XY.zip'

100%[=================================================================================================================>] 1,200,703,405 1.82M/s in 10m 11s

2011-04-27 14:00:38 (1.87 MB/s) - `dbNSFP.chr1-22XY.zip' saved [1200703405/1200703405]

Content

unzip -t dbNSFP.chr1-22XY.zip
Archive: dbNSFP.chr1-22XY.zip
testing: dbNSFP.chr1 OK
testing: dbNSFP.chr10 OK
testing: dbNSFP.chr11 OK
testing: dbNSFP.chr12 OK
testing: dbNSFP.chr13 OK
testing: dbNSFP.chr14 OK
testing: dbNSFP.chr15 OK
testing: dbNSFP.chr16 OK
testing: dbNSFP.chr17 OK
testing: dbNSFP.chr18 OK
testing: dbNSFP.chr19 OK
testing: dbNSFP.chr2 OK
testing: dbNSFP.chr20 OK
testing: dbNSFP.chr21 OK
testing: dbNSFP.chr22 OK
testing: dbNSFP.chr3 OK
testing: dbNSFP.chr4 OK
testing: dbNSFP.chr5 OK
testing: dbNSFP.chr6 OK
testing: dbNSFP.chr7 OK
testing: dbNSFP.chr8 OK
testing: dbNSFP.chr9 OK
testing: dbNSFP.chrX OK
testing: dbNSFP.chrY OK

Sample (verticalized)

>>2
$1 #chr : 22
$2 pos(1-based) : 15453440
$3 ref : T
$4 alt : G
$5 aaref : M
$6 aaalt : L
$7 hg19pos(1-based) : 17073440
$8 genename : CCT8L2
$9 geneid : 150160
$10 CCDSid : CCDS13738.1
$11 refcodon : ATG
$12 codonpos : 1
$13 fold-degenerate : 0
$14 aapos : 1
$15 cds_strand : -
$16 LRT_Omega : 1.116940
$17 PhyloP_score : 0.963611
$18 PlyloP_pred : C
$19 SIFT_score : 1.0
$20 SIFT_pred : D
$21 Polyphen2_score : 0.25
$22 Polyphen2_pred : P
$23 LRT_score : 0.419288
$24 LRT_pred : U
$25 MutationTaster_score : 1.0
$26 MutationTaster_pred : D
<<2
>>3
$1 #chr : 22
$2 pos(1-based) : 15453440
$3 ref : T
$4 alt : C
$5 aaref : M
$6 aaalt : V
$7 hg19pos(1-based) : 17073440
$8 genename : CCT8L2
$9 geneid : 150160
$10 CCDSid : CCDS13738.1
$11 refcodon : ATG
$12 codonpos : 1
$13 fold-degenerate : 0
$14 aapos : 1
$15 cds_strand : -
$16 LRT_Omega : 1.116940
$17 PhyloP_score : 0.963611
$18 PlyloP_pred : C
$19 SIFT_score : 1.0
$20 SIFT_pred : D
$21 Polyphen2_score : 0.25
$22 Polyphen2_pred : P
$23 LRT_score : 0.419288
$24 LRT_pred : U
$25 MutationTaster_score : 1.0
$26 MutationTaster_pred : D
<<3


That's it,

Pierre

15 April 2011

"404 not found": An update for "bioinformatics/cabios"

Yesterday, I blogged about the persistence of the URLs present in the abstract of NAR. Today , I've updated my tool and used it to scan the abstracts of the following pubmed query: "Bioinformatics"[JOUR] or "Comput Appl Biosci"[JOUR].

Here is the result:

YearTotalAlive%
1815
1995100
19969333
199713323
1998861922
1999701724
2000832530
20011106458
20021217864
200328417059
200440225763
200549535972
200637429779
200744838185
200846641589
200950746291
201060556693
201128326894


Again, even if we can reach a web site, it doesn't mean that the service described in an article is still available or maintained.

As suggested by Egon Willighagen, I've uploaded the RDF output of my program on figshare: http://figshare.com/figures/index.php/Bioinformatics.404_20110415.rdf.

That's it,

Pierre

14 December 2010

Looking for an expert ?

Yesterday, Andrew Su asked on Biostar: "Given a gene, what is the best automated method to identify the world experts? ".

Here is my solution:

  • First for a given gene name, we use NCBI-ESearch to find its Gene-Id in NCBI Gene
  • The Gene record is then downloaded as XML using NCBI-EFetch
  • XPATH is used to retrieve all the articles in pubmed about this gene and identified by the XML tags <PubMedId>
  • Each article is downloaded from pubmed. The element <Affiliation> is extracted from the record; sometimes this tag contains the the main contact's email. The authors are also extracted and we count the number of times each author was found. I tried to solve the problem of ambiguity for the names of the authors by looking at the name, surname and initials. If an author's name was contained in the e-mail, it was affected to him
  • At the end, all the authors are sorted in function of the number of times they were seen and the most prolific author is printed out.


Source code


Compilation

javac BioStar4296.java

Test

java BioStar4296 ZC3H7B eif4G1 PRNP

<?xml version="1.0" encoding="UTF-8"?>
<experts>
<gene name="ZC3H7B" geneId="23264" count-pmids="13">
<Person>
<firstName>Sumio</firstName>
<lastName>Sugano</lastName>
<pmid>8125298</pmid>
<pmid>9373149</pmid>
<pmid>14702039</pmid>
<affilitation>International and Interdisciplinary Studies, The University of Tokyo, Japan.</affilitation>
<affilitation>Institute of Medical Science, University of Tokyo, Japan.</affilitation>
<affilitation>Helix Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan.</affilitation>
</Person>
</gene>
<gene name="eif4G1" geneId="1981" count-pmids="106">
<Person>
<firstName>Nahum</firstName>
<lastName>Sonenberg</lastName>
<pmid>7651417</pmid>
<pmid>7935836</pmid>
<pmid>8449919</pmid>
(...)
<affilitation>Department of Biochemistry and McGill Cancer Center, McGill University, Montreal, H3G 1Y6, Quebec, Canada.</affilitation>
<affilitation>Department of Biochemistry, McGill University, Montreal, Quebec, Canada.</affilitation>
<affilitation>Laboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA.</affilitation>
(...)
</Person>
</gene>
<gene name="PRNP" geneId="5621" count-pmids="429">
<Person>
<firstName>John</firstName>
<lastName>Collinge</lastName>
<pmid>1352724</pmid>
<pmid>1677164</pmid>
<pmid>2159587</pmid>
<pmid>20583301</pmid>
(...)
<mail>j.collinge@ic.ac.uk</mail>
<affilitation>Krebs Institute for Biomolecular Research, Department of Molecular Biology and Biotechnology, University of Sheffield, Sheffield S10 2TN, UK.</affilitation>
<affilitation>MRC Prion Unit and Department of Neurogenetics, Imperial College School of Medicine at St. Mary's, London, United Kingdom. J.Collinge@ic.ac.uk</affilitation>
<affilitation>Division of Neuroscience (Neurophysiology), Medical School, University of Birmingham, Edgbaston, Birmingham, UK. sratte@pitt.edu</affilitation>
(...)
</Person>
</gene>
</experts>

about this result


  • ZC3H7B the result is wrong. In Dr Sugano's article (3 articles) ZC3H7B was present in among a large set of other genes used in his studies. The expert would be Dr D. Poncet, my former thesis advisor but he 'only' wrote two articles about this protein.
  • Eif4G1: I know Dr Sonenberg is the expert. His email wasn't found.
  • PRNP Collinge seems to be the expert. Dr Collinge's e-mail was detected.


That's it,

Pierre

11 October 2010

Playing with the Wordle algorithm: a tag cloud of Mesh Terms

The paper describing Wordle has been recently published (http://www.research.ibm.com/visual/papers/wordle_final2.pdf ). The algorithm was briefly described: “The most distinctive geometric aspect of a Wordle is the layout algorithm, which packs words to make efficient use of space. While many space-filling visualizations exist, they typically work by recursiveLayout proceeds according to this pseudocode:

sort words by weight, decreasing
for each word w:
w.position := makeInitialPosition(w);
while w intersects other words:
updatePosition(w);

The two key procedures here are "makeInitialPosition" and "updatePosition". The makeInitialPosition routine picks a point at random according to a distribution that takes into account the desired overall shape, and, if desired, alphabetical order. The updatePosition routine moves the word on a spiral of increasing radius, radiating from the word's starting position. The updatePosition routine is aware of constraints on the overall shape of the Wordle. Constraining the layout to a rectangular shape causes updatePosition to prefer positions inside of the strict boundaries of the playing field; a blobby overall shape accepts boundary violations. The rectangular constraint is relaxed when the spiral radius exceeds either playing field dimension. ”
Your browser does not support the <CANVAS> element !


Just for fun , I've implemented my own version of the Wordle Alogrithm. The Java code is available on github at http://github.com/lindenb/jsandbox/blob/master/src/sandbox/MyWordle.java. I won't describe the program here, but I'll just say that the code invokes a java.awt.font.TextLayout class to get the shape of the text:
Graphics2D g=(...)
FontRenderContext frc = g.getFontRenderContext();
Font font=new Font("Dialog",Font.BOLD,fontSize);
TextLayout textLayout=new TextLayout(w.getText(), font, frc);
Shape shape=textLayout.getOutline(null);
and a java.awt.geom.Area to test if two shapes intersects.

Ok, let's test this code. FIrst I'm going to dump a pubmed query as XML with another simple tool named PubmedDump. This XML file is then parsed with a javascript program called from the SAX parser saxstream.jar (previously described here):
mesh.js
importPackage(Packages.sandbox);
importPackage(Packages.java.io);
importPackage(Packages.java.awt);
var content=null;
var mesh2count={};

function startElement(uri,localName,name,atts)
{
if(name=="DescriptorName")
{
content="";
}
}
function characters(s)
{
if(content!=null) content+=s;
}
function endElement(uri,localName,name)
{
if(content!=null)
{
var count=mesh2count[content];
if(count===undefined) count=0;
mesh2count[content]=count+1;
}
content=null;
}
function endDocument()
{
var w= new MyWordle();
for(var s in mesh2count)
{
var word= new MyWordle.Word(s,mesh2count[s]);
w.add(word);
}
w.setUseArea(true);/* use shape area instead of bounding boxes */
w.setAllowRotate(true);
w.setSortType(1);/* sort by weight */
w.doLayout();

var f=new File("result.svg");
w.saveAsSVG(f);
}
This javascript code counts the occurence of each MESH term (<DescriptorName>), and when the document has been parsed, a new instance of MyWordle class is created, filled and the result is saved to a SVG file.

Invocation


Here is an example for the query "Rotavirus NSP3 NSP1":
java -jar pubmeddump.jar "Rotavirus NSP3 NSP1" |\
java -cp mywordle.jar:saxscript.jar org.lindenb.tinytools.SAXScript -n -f mesh.js


Result





That's it,
Pierre

22 September 2010

A Simple tool to get the sex ratio in pubmed.

Just for fun, I wrote a simple java tool to get the sex ratio of the authors in Pubmed. This program fetches a list of names/genders I found in the following perl module: http://cpansearch.perl.org/src/EDALY/Text-GenderFromName-0.33/GenderFromName.pm. The source code is available at

.

(In the following examples, the many names that couldn't be associated to a gender were ignored).

Bioinformatics


Here is the result for "Bioinformatics[journal]"
Women: 3178 (19%) Men: 13149 (80%)
Bioinformatics[Journal]


The 'Lancet' in 2009

Women: 579 (30%) Men: 1331 (69%)
Lancet[Journal] 2009[Date]


Nature in 2009

Women: 1616 (30%) Men: 3768 (69%)
Nature[Journal] 2009[Date]


Nursing in 2009

Women: 29 (70%) Men: 12 (29%)
Nursing[Journal] 2009[Date]



Articles about Charles Darwin

Women: 25 (17%) Men: 118 (82%)
"Darwin C"[PS]



etc... etc..

Source code

/**
* Author:
* Pierre Lindenbaum PhD
* plindenbaum@yahoo.fr
* Source of data:
* http://cpansearch.perl.org/src/EDALY/Text-GenderFromName-0.33/GenderFromName.pm
*/
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLEncoder;
import java.text.Collator;
import java.util.Locale;
import java.util.Map;
import java.util.TreeMap;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import javax.xml.stream.events.XMLEvent;

/**
* PubmedGender
*/
public class PubmedGender
{
private Map<String,Float> males=null;
private Map<String,Float> females=null;
private int limit=1000;
private String query="";
private int canvasSize=200;
private boolean ignoreUndefined=false;
private PubmedGender()
{
Collator collator= Collator.getInstance(Locale.US);
collator.setStrength(Collator.PRIMARY);
this.males=new TreeMap<String, Float>(collator);
this.females=new TreeMap<String, Float>(collator);
}

private void loadNames()
throws IOException
{
BufferedReader in=new BufferedReader(new InputStreamReader(new URL("http://cpansearch.perl.org/src/EDALY/Text-GenderFromName-0.33/GenderFromName.pm").openStream()));
String line;
Map<String,Float> map=null;
int posAssign=-1;
while((line=in.readLine())!=null)
{
if(line.startsWith("$Males = {"))
{
map=this.males;
}
else if(line.startsWith("$Females = {"))
{
map=this.females;
}
else if(line.contains("}"))
{
map=null;
}
else if(map!=null && ((posAssign=line.indexOf("=>"))!=-1))
{
String name=line.substring(0,posAssign).replaceAll("'","").toLowerCase().trim();
Float freq=Float.parseFloat(line.substring(posAssign+2).replaceAll("[',]","").toLowerCase().trim());
map.put(name, freq);
}
else
{
map=null;
}
}
in.close();
}
private XMLEventReader newReader(URL url) throws IOException,XMLStreamException
{
XMLInputFactory f= XMLInputFactory.newInstance();
f.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
f.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE,Boolean.FALSE);
f.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES,Boolean.TRUE);
f.setProperty(XMLInputFactory.IS_VALIDATING,Boolean.FALSE);
f.setProperty(XMLInputFactory.SUPPORT_DTD,Boolean.FALSE);
XMLEventReader reader=f.createXMLEventReader(url.openStream());
return reader;
}

private void run() throws Exception
{
int countMales=0;
int countFemales=0;
int countUnknown=0;

URL url= new URL(
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="+
URLEncoder.encode(this.query, "UTF-8")+
"&retstart=0&retmax="+this.limit+"&usehistory=y&retmode=xml&email=plindenbaum_at_yahoo.fr&tool=gender");

XMLEventReader reader= newReader(url);
XMLEvent evt;
String QueryKey=null;
String WebEnv=null;
int countId=0;
while(!(evt=reader.nextEvent()).isEndDocument())
{
if(!evt.isStartElement()) continue;
String tag= evt.asStartElement().getName().getLocalPart();
if(tag.equals("QueryKey"))
{
QueryKey= reader.getElementText().trim();
}
else if(tag.equals("WebEnv"))
{
WebEnv= reader.getElementText().trim();
}
else if(tag.equals("Id"))
{
++countId;
}
}
reader.close();

if(countId!=0)
{
url= new URL("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&WebEnv="+
URLEncoder.encode(WebEnv,"UTF-8")+
"&query_key="+URLEncoder.encode(QueryKey,"UTF-8")+
"&retmode=xml&retmax="+this.limit+"&email=plindenbaum_at_yahoo.fr&tool=mail");

reader= newReader(url);


while(reader.hasNext())
{
evt=reader.nextEvent();
if(!evt.isStartElement()) continue;
if(!evt.asStartElement().getName().getLocalPart().equals("Author")) continue;
String firstName=null;
String initials=null;

while(reader.hasNext())
{
evt=reader.nextEvent();
if(evt.isStartElement())
{
String localName=evt.asStartElement().getName().getLocalPart();
if(localName.equals("ForeName") || localName.equals("FirstName"))
{
firstName=reader.getElementText().toLowerCase();
}
else if(localName.equals("Initials"))
{
initials=reader.getElementText().toLowerCase();
}
}
else if(evt.isEndElement())
{
if(evt.asEndElement().getName().getLocalPart().equals("Author")) break;
}
}
if( firstName==null ) continue;
if( firstName.length()==1 ||
firstName.equals(initials)) continue;

String tokens[]=firstName.split("[ ]+");
firstName="";
for(String s:tokens)
{
if(s.length()> firstName.length())
{
firstName=s;
}
}


if( firstName.length()==1 ||
firstName.equals(initials)) continue;

Float male= this.males.get(firstName);
Float female= this.females.get(firstName);

if(male==null && female==null)
{
//System.err.println("Undefined "+firstName+" / "+lastName);
countUnknown++;
}
else if(male!=null && female==null)
{
countMales++;
}
else if(male==null && female!=null)
{
countFemales++;
}
else if(male < female)
{
countFemales++;
}
else if(female < male)
{
countMales++;
}
else
{
//System.err.println("Undefined "+firstName+" / "+lastName);
countUnknown++;
}
}
reader.close();
}
if(ignoreUndefined) countUnknown=0;

float total= countMales+countFemales+countUnknown;

double radMale=(countMales/total)*Math.PI*2.0;
double radFemale=(countFemales/total)*Math.PI*2.0;
int radius= (canvasSize-2)/2;
String id= "ctx"+System.currentTimeMillis()+""+(int)(Math.random()*1000);
XMLOutputFactory xmlfactory= XMLOutputFactory.newInstance();
XMLStreamWriter w= xmlfactory.createXMLStreamWriter(System.out,"UTF-8");
w.writeStartElement("html");
w.writeStartElement("body");
w.writeStartElement("div");
w.writeAttribute("style","margin:10px;padding:10px;text-align:center;");
w.writeStartElement("div");
w.writeEmptyElement("canvas");
w.writeAttribute("width", String.valueOf(canvasSize+1));
w.writeAttribute("height", String.valueOf(canvasSize+1));
w.writeAttribute("id", id);
w.writeStartElement("script");
w.writeCharacters(
"function paint"+id+"(){var canvas=document.getElementById('"+id+"');"+
"if (!canvas.getContext) return;var c=canvas.getContext('2d');"+
"c.fillStyle='white';c.strokeStyle='black';"+
"c.fillRect(0,0,"+canvasSize+","+canvasSize+");"+
"c.fillStyle='gray';c.beginPath();c.arc("+(canvasSize/2)+","+(canvasSize/2)+","+radius+",0,Math.PI*2,true);c.fill();c.stroke();"+
"c.fillStyle='blue';c.beginPath();c.moveTo("+(canvasSize/2)+","+(canvasSize/2)+");c.arc("+(canvasSize/2)+","+(canvasSize/2)+","+radius+",0,"+radMale+",false);c.closePath();c.fill();c.stroke();"+
"c.fillStyle='pink';c.beginPath();c.moveTo("+(canvasSize/2)+","+(canvasSize/2)+");c.arc("+(canvasSize/2)+","+(canvasSize/2)+","+radius+","+radMale+","+(radMale+radFemale)+",false);c.closePath();c.fill();c.stroke();}"+
"window.addEventListener('load',function(){ paint"+id+"(); },true);"
);
w.writeEndElement();
w.writeEndElement();

w.writeStartElement("span");
w.writeAttribute("style","color:pink;");
w.writeCharacters("Women: "+countFemales+" ("+(int)((countFemales/total)*100.0)+"%)");
w.writeEndElement();
w.writeCharacters(" ");
w.writeStartElement("span");
w.writeAttribute("style","color:blue;");
w.writeCharacters("Men: "+countMales+" ("+(int)((countMales/total)*100.0)+"%)");
w.writeEndElement();
w.writeCharacters(" ");

if(!this.ignoreUndefined)
{
w.writeStartElement("span");
w.writeAttribute("style","color:gray;");
w.writeCharacters("Undefined : "+countUnknown+" ("+(int)((countUnknown/total)*100.0)+"%)");
w.writeEndElement();
}
w.writeEmptyElement("br");

w.writeStartElement("a");
w.writeAttribute("target","_blank");
w.writeAttribute("href","http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&amp;cmd=search&amp;term="+URLEncoder.encode(this.query,"UTF-8"));
w.writeCharacters(this.query);
w.writeEndElement();


w.writeEndElement();
w.writeEndElement();
w.writeEndElement();
w.flush();
w.close();
}

public static void main(String[] args)
{
try
{
PubmedGender app=new PubmedGender();

int optind=0;
while(optind< args.length)
{
if(args[optind].equals("-h") ||
args[optind].equals("-help") ||
args[optind].equals("--help"))
{
System.err.println("Options:");
System.err.println(" -h help; This screen.");
System.err.println(" -w <int> canvas size default:"+app.canvasSize);
System.err.println(" -L <int> limit number default:"+app.limit);
System.err.println(" -i ignore undefined default:"+app.ignoreUndefined);
System.err.println(" query terms...");
return;
}
else if(args[optind].equals("-L"))
{
app.limit=Integer.parseInt(args[++optind]);
}
else if(args[optind].equals("-w"))
{
app.canvasSize=Integer.parseInt(args[++optind]);
}
else if(args[optind].equals("-i"))
{
app.ignoreUndefined=true;
}
else if(args[optind].equals("--"))
{
optind++;
break;
}
else if(args[optind].startsWith("-"))
{
System.err.println("Unknown option "+args[optind]);
return;
}
else
{
break;
}
++optind;
}
if(optind==args.length)
{
System.err.println("Query missing");
return;
}
app.query="";
while(optind< args.length)
{
if(!app.query.isEmpty()) app.query+=" ";
app.query+=args[optind++];
}
app.query=app.query.trim();
if(app.query.trim().isEmpty())
{
System.err.println("Query is empty");
return;
}
app.loadNames();

app.run();

}
catch (Exception e)
{
e.printStackTrace();
}
}
}


That's it

Pierre