Showing posts with label journal. Show all posts
Showing posts with label journal. Show all posts

27 May 2016

pubmed: extracting the 1st authors' gender and location who published in the Bioinformatics journal.

In this post I'll get some statistics about the 1st authors in the "Bioinformatics" journal from pubmed. I'll extract their genders and locations.
I'll use some tools I've already described some years ago but I've re-written them.

Downloading the data

To download the paper published in Bioinformatics, the pubmed/entrez query is '"Bioinformatics"[jour]'.
I use pubmeddump to download all those articles as XML from pubmed .
java -jar jvarkit/dist/pubmeddump.jar   '"Bioinformatics"[jour]'

Adding the authors' gender

PubmedGender is used to add two attributes '@male' or/and '@female' to the Pubmed/XML '<Author>' element.
<Author ValidYN="Y" male="169">
  <LastName>Lindenbaum</LastName>
  <ForeName>Pierre</ForeName>

Adding the authors' location

PubmedMap is used to add some attributes to the Pubmed/XML '<Affiliation>' element.
<Author>
 <LastName>Lai</LastName>
 <ForeName>Chih-Cheng</ForeName>
 <Initials>CC</Initials>
 <AffiliationInfo>
  <Affiliation domain="tw" place="Taiwan">Department of Intensive Care Medicine, Chi Mei Medical Center, Liouying, Tainan, Taiwan.</Affiliation>

Extracting the data from XML as a table

I use SAXScript to extract the data from XML.
A SAX parser is event-driven parser for XML. Here the events are invoked using a simple javascript program.
The script below will find the sex , the year of publication and the location of each 1st author of each article and print the results as text table.
/** current text content */
var content=null;
/** author position in the article */
var count_authors=0;
/** current author */
var author=null;
/** in element <PubDate> */
var in_pubdate=false;
/** current year */
var year=null;

 /** called when a new element XML is found */
function startElement(uri,localName,name,atts)
    {
 if(name=="PubDate")
  { in_pubdate=true;}
 else if(in_pubdate && name=="Year")
  { content="";}
    else if(name=="Author" && count_authors==0) {
  content="";
  /** get sex */
  var male = atts.getValue("male");
  var female = atts.getValue("female");
  var gender = (male==null?(female==null?null:"F"):"M");
  /* both male & female ? get the highest score */
  if(male!=null && female!=null)
   {
   var fm= parseInt(male);
   var ff= parseInt(female);
   gender= (fm>ff?"M":"F");
   }
  if(gender!=null) author={"sex":gender,"year":year,"domain":null};
  }
    else if(author!=null && name=="Affiliation") {
  author.domain = atts.getValue("domain");
  }
        }

/** in text node, append the text  */
function characters(s)
        {
        if(content!=null) content+=s;
        }

/** end of XML element */
function endElement(uri,localName,name)
        {
        if(name=="PubDate") { in_pubdate=false;}
        else if(in_pubdate && name=="Year") { year=content;}
        else if(name=="PubmedArticle" || name=="PubmedBookArticle")
   {
   count_authors=0;
   author=null;
   year=null;
   in_pubdate=false;
   }
        else if(name=="Author") {
   count_authors++;
   /* print first author */
   if(author!=null) {
    print(author.sex+"\t"+author.year+"\t"+author.domain);
    author=null;
    }
   }

        content=null;
        }

All in one

#download database of names
wget -O names.zip "https://www.ssa.gov/oact/babynames/names.zip" 
unzip -p names.zip yob2015.txt > names.csv
rm names.zip

java -jar jvarkit/dist/pubmeddump.jar   '"Bioinformatics"[jour]' |\
 java -jar jvarkit/dist/pubmedgender.jar  -d names.csv |\
 java -jar jvarkit/dist/pubmedmap.jar  |\
 java -jar src/jsandbox/dist/saxscript.jar -f pubmed.js > data.csv

The output (count, sex , year , country ):
$ cat data.csv  | sort | uniq -c | sort -n
(...)
    105 M 2015 us
    107 M 2004 us
    107 M 2013 us
    115 M 2008 us
    117 M 2011 us
    120 M 2009 us
    122 M 2010 us
    126 M 2014 us
    130 M 2012 us
    139 M 2005 us

That's it, Pierre

14 October 2012

Calculating time from submission to publication / Degree of burden in submitting a paper

After "404 not found": a database of non-functional resources in the NAR database collection, I've uploaded my second dataset on figshare:
Calculating time from submission to publication / Degree of burden in submitting a paper
.

Calculating time from submission to publication / Degree of burden in submitting a paper. Pierre Lindenbaum,  Ryan Delahanty.
figshare.
Retrieved 10:13, Oct 14, 2012 (GMT)
http://dx.doi.org/10.6084/m9.figshare.96403

This dataset was inspired by this post on biostar, initialy asked by Ryan Delahanty: I was wondering if it would be possible to calculate some kind of a metric for the speed-of-publication for each journal. I'm not sure submitted and accepted dates are available for all papers, but I noticed in XML data there are fields like the following:
<PubmedData>
        <History>
            <PubMedPubDate PubStatus="received">
                <Year>2011</Year>
                <Month>11</Month>
                <Day>29</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="accepted">
                <Year>2011</Year>
                <Month>12</Month>
                <Day>20</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
           (...)

In this dataset, the script 'pubmed.sh" downloads the the journals from http://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.pubmedhelptable45/ , the 'eigenfactors' from http://www.eigenfactor.org.

For each journal , It scans pubmed (starting from year=2000) and get the difference between the date[@PubStatus='received'] and the date[@PubStatus='accepted'].

titleissneigenfactordays
"Acta biochimica Polonica"0001-527X0.003996119.770935960591
"Acta biomaterialia"1742-70610.02152129.682692307692
"Acta biotheoretica"0001-53420.000844161.897058823529
"Acta cirurgica brasileira / Sociedade Brasileira para Desenvolvimento Pesquisa em Cirurgia"0102-86500.00128122.038461538462
"Acta cytologica"0001-55470.00230565.3006134969325
"Acta diabetologica"0940-54290.001851299.6
"Acta haematologica"0001-57920.002825118.654676258993
"Acta histochemica"0065-12810.002162110.471204188482
"Acta histochemica et cytochemica"0044-59910.00067781.6455696202532
"Acta neurochirurgica"0001-62680.009685204.371830985916
"Acta neuropathologica"0001-63220.02347169.7277882797732
"Acta theriologica"0001-70510.000901147.0
"Acta tropica"0001-706X0.01011196.577777777778
"Acta veterinaria Scandinavica"0044-605X0.00161282.0
"Addictive behaviors"0306-46030.017915163.049731182796
"Advances in space research "0273-11770.021217205.0
Ambio0044-74470.007463181.878048780488
"American journal of human genetics"0002-92970.12015667.1898928024502
"American journal of hypertension"0895-70610.017359104.074576271186
(....)

Here is the kind of figure I got:

As far as I remember, "Cell" is the point having the highest eigenfactor.


Note: pubmed contains some errors: e.g. received > accepted (http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20591334&retmode=xml) or some dates in the future: ( http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12921703&retmode=xml )


That's it,

Pierre

13 December 2010

A new journal: BMC Open Research Computation #OpenResComp


Citing ''Aims & scope'':Open Research Computation publishes peer reviewed articles that describe the development, capacities, and uses of software designed for use by researchers in any field.

Submissions relating to software for use in any area of research are welcome as are articles dealing with algorithms, useful code snippets, as well as large applications or web services, and libraries.

Open Research Computation differs from other journals with a software focus in its requirement for the software source code to be made available under an Open Source Initiative compliant license, and in its assessment of the quality of documentation and testing of the software.

In addition to articles describing software Open Research Computation also welcomes submissions that review or describe developments relating to software based tools for research. These include, but are not limited to, reviews or proposals for standards, discussion of best practice in research software development, educational and support resources and tools for researchers that develop or use software based tools.


See also the insights from Cameron Neylon, Jan Aerts, Neil 10K Saunders ...

21 July 2008

SciFOAF 2.0

If you're following me on twitter or on friendfeed you may know that I've re-written a new version of SciFOAF.

Here is the documentation:



What is SciFOAF


SciFOAF is the second version of a tool I created to build a FOAF/RDF file from your publications in ncbi/pubmed. The FOAF project defines a semantic format based on RDF/XML to define persons or groups, their relationships, as well as their basic properties such as name, e-mail address, subjects of interest, publications, and so on... This FOAF profile can be used to describe your work, your laboratory, your contacts.
The first version was introduced in 2006 here as a java webstart interface and had many problems:

  • the RDF file could not be loaded/saved

  • only a few properties could be edited

  • authors'name definition may vary from one journal to another as some journal may use the initial of an author while another may use the complete first name.

  • the interaction was just a kind of multiple-choice questionnaire


The new version now uses the Jena API, the rdf repository can be loaded and saved.

Requirements



Downloading SciFOAF


A *.jar file should be available for download at http://lindenb.googlecode.com/files/scifoaf.jar.

Running SciFOAF


Setup the CLASSPATH
export JENA_LIB=your_path_to/Jena/lib
export CLASSPATH=${JENA_LIB}/antlr-2.7.5.jar:${JENA_LIB}/arq-extra.jar:${JENA_LIB}/arq.jar:${JENA_LIB}/commons-logging-1.1.1.jar:${JENA_LIB}/concurrent.jar:${JENA_LIB}/icu4j_3_4.jar:${JENA_LIB}/iri.jar:${JENA_LIB}/jena.jar:${JENA_LIB}/jenatest.jar:${JENA_LIB}/json.jar:${JENA_LIB}/junit.jar:${JENA_LIB}/log4j-1.2.12.jar:${JENA_LIB}/lucene-core-2.3.1.jar:${JENA_LIB}/stax-api-1.0.jar:${JENA_LIB}/wstx-asl-3.0.0.jar:${JENA_LIB}/xercesImpl.jar:${JENA_LIB}/xml-apis.jar:YOUR_PATH_TO/scifoaf.jar

Run SciFOAF
java org.lindenb.scifoaf.SciFOAF

the first time your run SciFOAF, You're prompted to give yourself an URI. The best choice would be to give the URL where your foaf file will be stored or the URL of your personnal homepage or blog. On startup a file called foaf.rdf will be created in your home directory. Alternatively you can specify a file on the command line.
When the application is closed, the FOAF model will be saved back to the file.

The Main Pane


The first window contains a sequence of tab Each tab fits to a given rdf Class:

  • foaf:Person

  • geo:Place

  • bibo:Article

  • ...


For each tab, a button "New ...." creates a new instance of the given Class.

Building your profile


Add a foaf:Image


Add the URL of the picture, for example: http://upload.wikimedia.org/wikipedia/commons/4/42/Charles_Darwin_aged_51.jpg.

Add an bibo:Article


enter the PMID of the artcle

Add a geo:Place


SciFOAF, uses the geonames.org API.

Add a foaf:Person


You can the link this person to his publication, his foaf:based_near, the persons he knows..
SciFOAF 2.0

Etc...


Create foaf:Group, event:Event, doap:Project....

Exporting to KML


(Experimental) In menu "File' select 'Export to KML'. SciFOAF will export a KML file containing the geolocalized foaf:Persons.
A test is available here and is visible in maps.google.com at http://maps.google.com/maps?q=http://yokofakun.....

Exporting to XHTML+SVG


(Experimental) In menu "File' select 'Export to XHTML'. Here, I've roughly copied the tool I wrote for exploring the Nature Network using SVG/javscript/JSON/XTML. Many things remain to do.
Nature Network

Loading a Batch of Articles


In the main panel, for bibo:Article a button can be used to load a batch of articles.
On ncbi/pubmed, perform a query, choose
Display: and then . Copy the list of PMID and paste it in the "Load Batch" dialog, press OK. After a moment, all the articles are uploaded in the RDF model.

Example


A RDF File describing a few persons in the Biogang is available here.

Source Code


The source code is available on http://code.google.com/p/lindenb/.
The ant file is in
lindenb/proj/scifoaf/build.xml
.


Pierre

26 October 2007

“Getting Started In…”

In PLOS, today: This month, PLoS Computational Biology and the ISCB begin a series of short, practical articles for students and active researchers who want to learn more about new areas of computational biology and are unsure where or how to start. The aim of each article in the “Getting Started in…” series is to introduce the essentials: define the area and what it is about, highlight the debates and issues of relevance, and provide directions to the most relevant books, articles, or Web sites to find out more...

The first expert to inform, motivate, and inspire readers to consider a new direction is Dr. Xiaole Shirley Liu, who introduces tiling microarrays.

“Getting Started In…”: A Series Not to Miss: PLoS Computational Biology 3 (10), e224 (2007)
Getting Started in Tiling Microarray Analysis : PLoS Computational Biology 3 (10), e183 (2007)

13 July 2007

NAR, Web Server issue July 2007



The annual "Web Server Issue" of "Nucleic Acids Research" is available at :http://nar.oxfordjournals.org/content/vol35/suppl_2/index.dtl?etoc. This issue reports on 130 web servers.

Pierre

26 April 2007

New Journal: Human Frontier Science Program Journal


The HFSP Journal aims to publish high quality, innovative interdisciplinary basic research at the frontier of biology over a wide range of organizational levels (from the molecular level to population biology) using principles strategies or technologies from the more quantitative disciplines (e.g. physics, chemistry, mathematics, engineering, or informatics).

http://hfspj.aip.org/

22 January 2007

Launch of BMC Systems Biology


BMC Systems Biology, the first open access journal focussed solely on the entire emerging subject of systems biology, has just published its first articles.