pubmed: extracting the 1st authors' gender and location who published in the Bioinformatics journal.
In this post I'll get some statistics about the 1st authors in the "Bioinformatics" journal from pubmed. I'll extract their genders and locations.
I'll use some tools I've already described some years ago but I've re-written them.
Downloading the data
To download the paper published in Bioinformatics, the pubmed/entrez query is '"Bioinformatics"[jour]'.I use pubmeddump to download all those articles as XML from pubmed .
java -jar jvarkit/dist/pubmeddump.jar '"Bioinformatics"[jour]'
Adding the authors' gender
PubmedGender is used to add two attributes '@male' or/and '@female' to the Pubmed/XML '<Author>' element.<Author ValidYN="Y" male="169">
<LastName>Lindenbaum</LastName>
<ForeName>Pierre</ForeName>
Adding the authors' location
PubmedMap is used to add some attributes to the Pubmed/XML '<Affiliation>' element.<Author>
<LastName>Lai</LastName>
<ForeName>Chih-Cheng</ForeName>
<Initials>CC</Initials>
<AffiliationInfo>
<Affiliation domain="tw" place="Taiwan">Department of Intensive Care Medicine, Chi Mei Medical Center, Liouying, Tainan, Taiwan.</Affiliation>
Extracting the data from XML as a table
I use SAXScript to extract the data from XML.A SAX parser is event-driven parser for XML. Here the events are invoked using a simple javascript program.
The script below will find the sex , the year of publication and the location of each 1st author of each article and print the results as text table.
/** current text content */ var content=null; /** author position in the article */ var count_authors=0; /** current author */ var author=null; /** in element <PubDate> */ var in_pubdate=false; /** current year */ var year=null; /** called when a new element XML is found */ function startElement(uri,localName,name,atts) { if(name=="PubDate") { in_pubdate=true;} else if(in_pubdate && name=="Year") { content="";} else if(name=="Author" && count_authors==0) { content=""; /** get sex */ var male = atts.getValue("male"); var female = atts.getValue("female"); var gender = (male==null?(female==null?null:"F"):"M"); /* both male & female ? get the highest score */ if(male!=null && female!=null) { var fm= parseInt(male); var ff= parseInt(female); gender= (fm>ff?"M":"F"); } if(gender!=null) author={"sex":gender,"year":year,"domain":null}; } else if(author!=null && name=="Affiliation") { author.domain = atts.getValue("domain"); } } /** in text node, append the text */ function characters(s) { if(content!=null) content+=s; } /** end of XML element */ function endElement(uri,localName,name) { if(name=="PubDate") { in_pubdate=false;} else if(in_pubdate && name=="Year") { year=content;} else if(name=="PubmedArticle" || name=="PubmedBookArticle") { count_authors=0; author=null; year=null; in_pubdate=false; } else if(name=="Author") { count_authors++; /* print first author */ if(author!=null) { print(author.sex+"\t"+author.year+"\t"+author.domain); author=null; } } content=null; }
All in one
#download database of names wget -O names.zip "https://www.ssa.gov/oact/babynames/names.zip" unzip -p names.zip yob2015.txt > names.csv rm names.zip java -jar jvarkit/dist/pubmeddump.jar '"Bioinformatics"[jour]' |\ java -jar jvarkit/dist/pubmedgender.jar -d names.csv |\ java -jar jvarkit/dist/pubmedmap.jar |\ java -jar src/jsandbox/dist/saxscript.jar -f pubmed.js > data.csv
The output (count, sex , year , country ):
$ cat data.csv | sort | uniq -c | sort -n (...) 105 M 2015 us 107 M 2004 us 107 M 2013 us 115 M 2008 us 117 M 2011 us 120 M 2009 us 122 M 2010 us 126 M 2014 us 130 M 2012 us 139 M 2005 us
That's it, Pierre