YOKOFAKUN: May 2016

27 May 2016

pubmed: extracting the 1st authors' gender and location who published in the Bioinformatics journal.

In this post I'll get some statistics about the 1st authors in the "Bioinformatics" journal from pubmed. I'll extract their genders and locations.
I'll use some tools I've already described some years ago but I've re-written them.

Downloading the data

To download the paper published in Bioinformatics, the pubmed/entrez query is '"Bioinformatics"[jour]'.
I use pubmeddump to download all those articles as XML from pubmed .

java -jar jvarkit/dist/pubmeddump.jar   '"Bioinformatics"[jour]'

Adding the authors' gender

PubmedGender is used to add two attributes '@male' or/and '@female' to the Pubmed/XML '<Author>' element.

<Author ValidYN="Y" male="169">
  <LastName>Lindenbaum</LastName>
  <ForeName>Pierre</ForeName>

Adding the authors' location

PubmedMap is used to add some attributes to the Pubmed/XML '<Affiliation>' element.

<Author>
 <LastName>Lai</LastName>
 <ForeName>Chih-Cheng</ForeName>
 <Initials>CC</Initials>
 <AffiliationInfo>
  <Affiliation domain="tw" place="Taiwan">Department of Intensive Care Medicine, Chi Mei Medical Center, Liouying, Tainan, Taiwan.</Affiliation>

Extracting the data from XML as a table

I use SAXScript to extract the data from XML.
A SAX parser is event-driven parser for XML. Here the events are invoked using a simple javascript program.
The script below will find the sex , the year of publication and the location of each 1st author of each article and print the results as text table.

/** current text content */
var content=null;
/** author position in the article */
var count_authors=0;
/** current author */
var author=null;
/** in element <PubDate> */
var in_pubdate=false;
/** current year */
var year=null;

 /** called when a new element XML is found */
function startElement(uri,localName,name,atts)
    {
 if(name=="PubDate")
  { in_pubdate=true;}
 else if(in_pubdate && name=="Year")
  { content="";}
    else if(name=="Author" && count_authors==0) {
  content="";
  /** get sex */
  var male = atts.getValue("male");
  var female = atts.getValue("female");
  var gender = (male==null?(female==null?null:"F"):"M");
  /* both male & female ? get the highest score */
  if(male!=null && female!=null)
   {
   var fm= parseInt(male);
   var ff= parseInt(female);
   gender= (fm>ff?"M":"F");
   }
  if(gender!=null) author={"sex":gender,"year":year,"domain":null};
  }
    else if(author!=null && name=="Affiliation") {
  author.domain = atts.getValue("domain");
  }
        }

/** in text node, append the text  */
function characters(s)
        {
        if(content!=null) content+=s;
        }

/** end of XML element */
function endElement(uri,localName,name)
        {
        if(name=="PubDate") { in_pubdate=false;}
        else if(in_pubdate && name=="Year") { year=content;}
        else if(name=="PubmedArticle" || name=="PubmedBookArticle")
   {
   count_authors=0;
   author=null;
   year=null;
   in_pubdate=false;
   }
        else if(name=="Author") {
   count_authors++;
   /* print first author */
   if(author!=null) {
    print(author.sex+"\t"+author.year+"\t"+author.domain);
    author=null;
    }
   }

        content=null;
        }

All in one

#download database of names
wget -O names.zip "https://www.ssa.gov/oact/babynames/names.zip" 
unzip -p names.zip yob2015.txt > names.csv
rm names.zip

java -jar jvarkit/dist/pubmeddump.jar   '"Bioinformatics"[jour]' |\
 java -jar jvarkit/dist/pubmedgender.jar  -d names.csv |\
 java -jar jvarkit/dist/pubmedmap.jar  |\
 java -jar src/jsandbox/dist/saxscript.jar -f pubmed.js > data.csv

The output (count, sex , year , country ):

$ cat data.csv  | sort | uniq -c | sort -n
(...)
    105 M 2015 us
    107 M 2004 us
    107 M 2013 us
    115 M 2008 us
    117 M 2011 us
    120 M 2009 us
    122 M 2010 us
    126 M 2014 us
    130 M 2012 us
    139 M 2005 us

That's it, Pierre

21 May 2016

Playing with the @ORCID_Org / @ncbi_pubmed graph. My notebook.

"ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. "

I've recently discovered that pubmed now integrates ORCID identfiers.

and so it begins ! :-D @ORCID_Org Orcid Identififiers in @ncbi_pubmed https://t.co/hEBPQOoYjH
YESSSS !!!!!!!!!!!!! pic.twitter.com/B0fWNU8V2A
— Pierre Lindenbaum (@yokofakun) May 19, 2016

And there are several minor problems, I found some articles where the ORCID id is malformed or where different people use the same ORCID-ID:

The dream is over: two authors sharing the same @ORCID_Org Orcid ID in pubmed https://t.co/NdSW87fV3Y pic.twitter.com/p7EnTEl8Mc
— Pierre Lindenbaum (@yokofakun) May 20, 2016

for now, I've found 45 papers in pubmed having a problem with their @ORCID_Org ID : https://t.co/4H20PLtvJe
— Pierre Lindenbaum (@yokofakun) May 20, 2016

"- I suggest we all use the same @ORCID_Org in the lab
- sounds legit" pic.twitter.com/m5yvi60DRL
— Pierre Lindenbaum (@yokofakun) May 20, 2016

You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID].
I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.

<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
  <!--Generated with PubmedOrcidGraph https://github.com/lindenb/jvarkit/wiki/PubmedOrcidGraph - Pierre Lindenbaum.-->
  <PubmedArticle pmid="27197243" doi="10.1101/gr.199760.115">
    <year>2016</year>
    <journal>Genome Res.</journal>
    <title>Improved definition of the mouse transcriptome via targeted RNA sequencing.</title>
    <Author orcid="0000-0002-4078-7413">
      <foreName>Giovanni</foreName>
      <lastName>Bussotti</lastName>
      <initials>G</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-4449-1863">
      <foreName>Tommaso</foreName>
      <lastName>Leonardi</lastName>
      <initials>T</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-6090-3100">
      <foreName>Anton J</foreName>
      <lastName>Enright</lastName>
      <initials>AJ</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
  </PubmedArticle>
  <PubmedArticle pmid="27197225" doi="10.1101/gr.204479.116">
    <year>2016</year>
    <journal>Genome Res.</journal>
(...)

Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statement.

<?xml version="1.0"?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0"
    xmlns:xalan="http://xml.apache.org/xalan"
    xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
    exclude-result-prefixes="xalan str"
 >
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>

<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values(<xsl:variable name="o2" select="@orcid"/>
<xsl:choose>
 <xsl:when test="str:strcmp( $o1 , $o2) < 0">'<xsl:value-of select='$o1'/>','<xsl:value-of select='$o2'/>'</xsl:when>
 <xsl:otherwise>'<xsl:value-of select='$o2'/>','<xsl:value-of select='$o1'/>'</xsl:otherwise>
</xsl:choose>);
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

This stylesheet contains an extension 'strmcp' for the xslt processor xalan to compare two XML strings
This extension is just used to always be sure that the field "orcid1" in the table "collab" is always lower than "orcid2" to avoid duplicates pairs.

./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml

create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)

and those sql statetements are loaded into sqlite3:

./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml |\
 sqlite3 orcid.sqlite

The next step is to produce a gexf+xml file to play with the orcid graph in gephi.
I use the following bash script to convert the sqlite3 database to gexf+xml.

DB=orcid.sqlite

cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">

<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF

sqlite3 -separator ' ' -noheader  ${DB} 'select orcid,name,affiliation from author' |\
 sed  -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g"  -e 's/"/\"/g' |\
 awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'

echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader  ${DB} 'select orcid1,orcid2 from collab' |\
 awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"

If you want to play, I've uploaded the gephi+pubmed graph as gexf/gephi here: https://t.co/0nRRts7gXm (4Mb) pic.twitter.com/8RGuI7X3ZE
— Pierre Lindenbaum (@yokofakun) May 21, 2016

The output is saved and then loaded into gephi.

playing with the ORCID/pubmed graph in gephi pic.twitter.com/1Ao5OC7ywI
— Pierre Lindenbaum (@yokofakun) May 21, 2016

where is my lab @institut_thorax in the pubmed/orcid graph of co-authorships ? pic.twitter.com/3Krqk5K1o8
— Pierre Lindenbaum (@yokofakun) May 21, 2016

That's it,

Pierre

17 May 2016

finding new intron-exon junctions using the public Encode RNASeq data

I've been asked to look for some new / suspected / previously uncharacterized intron-exon junctions in public RNASeq data.
I've used the BAMs under http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/.

The following command is used to build the list of BAMs:

curl -s  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/" |\
tr ' <>"' "\n" | grep -F .bam | grep -v bai | sort | uniq | sed 's/.bam$//' | sed 's/$/ \\/' 

wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400SplicesRep2V2 \
(...)

This list is inserted as a list named SAMPLES a Makefile.

For each BAM, we use samtools to retrieve the reads in the region(s)) of interest. The reads are then filtered with samjs (https://github.com/lindenb/jvarkit/wiki/SamJS) to only keep the reads carrying an intron-exon junction at the desired location(s). Basically, the javascript-based filter loops over the CIGAR string of the read, computes the genomic interval skipped when the cigar operator is a deletion or a skipped region/intron. The read is printed if it describes the new intron-exon junction.

All in one:

That's it,

Pierre

YOKOFAKUN