and so it begins ! :-D @ORCID_Org Orcid Identififiers in @ncbi_pubmed https://t.co/hEBPQOoYjH
— Pierre Lindenbaum (@yokofakun) May 19, 2016
YESSSS !!!!!!!!!!!!! pic.twitter.com/B0fWNU8V2A
And there are several minor problems, I found some articles where the ORCID id is malformed or where different people use the same ORCID-ID:
The dream is over: two authors sharing the same @ORCID_Org Orcid ID in pubmed https://t.co/NdSW87fV3Y pic.twitter.com/p7EnTEl8Mc
— Pierre Lindenbaum (@yokofakun) May 20, 2016
for now, I've found 45 papers in pubmed having a problem with their @ORCID_Org ID : https://t.co/4H20PLtvJe
— Pierre Lindenbaum (@yokofakun) May 20, 2016
"- I suggest we all use the same @ORCID_Org in the lab
— Pierre Lindenbaum (@yokofakun) May 20, 2016
- sounds legit" pic.twitter.com/m5yvi60DRL
You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID].
I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.
<?xml version="1.0" encoding="UTF-8"?> <PubmedArticleSet> <!--Generated with PubmedOrcidGraph https://github.com/lindenb/jvarkit/wiki/PubmedOrcidGraph - Pierre Lindenbaum.--> <PubmedArticle pmid="27197243" doi="10.1101/gr.199760.115"> <year>2016</year> <journal>Genome Res.</journal> <title>Improved definition of the mouse transcriptome via targeted RNA sequencing.</title> <Author orcid="0000-0002-4078-7413"> <foreName>Giovanni</foreName> <lastName>Bussotti</lastName> <initials>G</initials> <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation> </Author> <Author orcid="0000-0002-4449-1863"> <foreName>Tommaso</foreName> <lastName>Leonardi</lastName> <initials>T</initials> <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation> </Author> <Author orcid="0000-0002-6090-3100"> <foreName>Anton J</foreName> <lastName>Enright</lastName> <initials>AJ</initials> <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation> </Author> </PubmedArticle> <PubmedArticle pmid="27197225" doi="10.1101/gr.204479.116"> <year>2016</year> <journal>Genome Res.</journal> (...)Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statement.
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:xalan="http://xml.apache.org/xalan" xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings" exclude-result-prefixes="xalan str" > <xsl:output method="text"/> <xsl:variable name="q">'</xsl:variable> <xsl:template match="/"> create table author(orcid text unique,name text,affiliation text); create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2)); begin transaction; <xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/> commit; </xsl:template> <xsl:template match="PubmedArticle"> <xsl:for-each select="Author"> <xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>'); <xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values(<xsl:variable name="o2" select="@orcid"/> <xsl:choose> <xsl:when test="str:strcmp( $o1 , $o2) < 0">'<xsl:value-of select='$o1'/>','<xsl:value-of select='$o2'/>'</xsl:when> <xsl:otherwise>'<xsl:value-of select='$o2'/>','<xsl:value-of select='$o1'/>'</xsl:otherwise> </xsl:choose>); </xsl:for-each> </xsl:for-each> </xsl:template> </xsl:stylesheet>
This stylesheet contains an extension 'strmcp' for the xslt processor xalan to compare two XML strings
This extension is just used to always be sure that the field "orcid1" in the table "collab" is always lower than "orcid2" to avoid duplicates pairs.
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml create table author(orcid text unique,name text,affiliation text); create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2)); begin transaction; insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;'); insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863'); insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100'); insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;'); insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100'); insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;'); (...)and those sql statetements are loaded into sqlite3:
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml |\ sqlite3 orcid.sqlite
The next step is to produce a gexf+xml file to play with the orcid graph in gephi.
I use the following bash script to convert the sqlite3 database to gexf+xml.
DB=orcid.sqlite cat << EOF <?xml version="1.0" encoding="UTF-8"?> <gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2"> <meta> <creator>Pierre Lindenbaum</creator> <description>Orcid Graph</description> </meta> <graph defaultedgetype="undirected" mode="static"> <attributes class="node"> <attribute type="string" title="affiliation" id="0"/> </attributes> <nodes> EOF sqlite3 -separator ' ' -noheader ${DB} 'select orcid,name,affiliation from author' |\ sed -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g" -e 's/"/\"/g' |\ awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}' echo "</nodes><edges>" sqlite3 -separator ' ' -noheader ${DB} 'select orcid1,orcid2 from collab' |\ awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}' echo "</edges></graph></gexf>"
If you want to play, I've uploaded the gephi+pubmed graph as gexf/gephi here: https://t.co/0nRRts7gXm (4Mb) pic.twitter.com/8RGuI7X3ZE
— Pierre Lindenbaum (@yokofakun) May 21, 2016
The output is saved and then loaded into gephi.
playing with the ORCID/pubmed graph in gephi pic.twitter.com/1Ao5OC7ywI
— Pierre Lindenbaum (@yokofakun) May 21, 2016
where is my lab @institut_thorax in the pubmed/orcid graph of co-authorships ? pic.twitter.com/3Krqk5K1o8
— Pierre Lindenbaum (@yokofakun) May 21, 2016
That's it,
Pierre
No comments:
Post a Comment