Showing posts with label network. Show all posts
Showing posts with label network. Show all posts

21 May 2016

Playing with the @ORCID_Org / @ncbi_pubmed graph. My notebook.

"ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. "
I've recently discovered that pubmed now integrates ORCID identfiers.

And there are several minor problems, I found some articles where the ORCID id is malformed or where different people use the same ORCID-ID:







You can download the papers containing some orcid Identifiers using the entrez query http://www.ncbi.nlm.nih.gov/pubmed/?term=orcid[AUID].
I've used one of my tools pubmeddump to download the articles asXML and I wrote PubmedOrcidGraph to extract the author's orcid.
<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
  <!--Generated with PubmedOrcidGraph https://github.com/lindenb/jvarkit/wiki/PubmedOrcidGraph - Pierre Lindenbaum.-->
  <PubmedArticle pmid="27197243" doi="10.1101/gr.199760.115">
    <year>2016</year>
    <journal>Genome Res.</journal>
    <title>Improved definition of the mouse transcriptome via targeted RNA sequencing.</title>
    <Author orcid="0000-0002-4078-7413">
      <foreName>Giovanni</foreName>
      <lastName>Bussotti</lastName>
      <initials>G</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-4449-1863">
      <foreName>Tommaso</foreName>
      <lastName>Leonardi</lastName>
      <initials>T</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
    <Author orcid="0000-0002-6090-3100">
      <foreName>Anton J</foreName>
      <lastName>Enright</lastName>
      <initials>AJ</initials>
      <affiliation>EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;</affiliation>
    </Author>
  </PubmedArticle>
  <PubmedArticle pmid="27197225" doi="10.1101/gr.204479.116">
    <year>2016</year>
    <journal>Genome Res.</journal>
(...)
Now, I want to insert those data into a sqlite3 database. I use the XSLT stylesheet below to convert the XML into some SQL statement.
<?xml version="1.0"?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0"
    xmlns:xalan="http://xml.apache.org/xalan"
    xmlns:str="xalan://com.github.lindenb.xslt.strings.Strings"
    exclude-result-prefixes="xalan str"
 >
<xsl:output method="text"/>
<xsl:variable name="q">'</xsl:variable>

<xsl:template match="/">
create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
<xsl:apply-templates select="PubmedArticleSet/PubmedArticle"/>
commit;
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:for-each select="Author">
<xsl:variable name="o1" select="@orcid"/>insert or ignore into author(orcid,name,affiliation) values ('<xsl:value-of select="$o1"/>','<xsl:value-of select="translate(concat(lastName,' ',foreName),$q,' ')"/>','<xsl:value-of select="translate(affiliation,$q,' ')"/>');
<xsl:for-each select="following-sibling::Author">insert or ignore into collab(orcid1,orcid2) values(<xsl:variable name="o2" select="@orcid"/>
<xsl:choose>
 <xsl:when test="str:strcmp( $o1 , $o2) < 0">'<xsl:value-of select='$o1'/>','<xsl:value-of select='$o2'/>'</xsl:when>
 <xsl:otherwise>'<xsl:value-of select='$o2'/>','<xsl:value-of select='$o1'/>'</xsl:otherwise>
</xsl:choose>);
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

This stylesheet contains an extension 'strmcp' for the xslt processor xalan to compare two XML strings
This extension is just used to always be sure that the field "orcid1" in the table "collab" is always lower than "orcid2" to avoid duplicates pairs.
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml

create table author(orcid text unique,name text,affiliation text);
create table collab(orcid1 text,orcid2 text,unique(orcid1,orcid2));
begin transaction;
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4078-7413','Bussotti Giovanni','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-4449-1863');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4078-7413','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-4449-1863','Leonardi Tommaso','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
insert or ignore into collab(orcid1,orcid2) values('0000-0002-4449-1863','0000-0002-6090-3100');
insert or ignore into author(orcid,name,affiliation) values ('0000-0002-6090-3100','Enright Anton J','EMBL, European Bioinformatics Institute, Cambridge, CB10 1SD, United Kingdom;');
(...)
and those sql statetements are loaded into sqlite3:
./src/xslt-sandbox/xalan/dist/xalan -XSL orcid2sqlite.xsl -IN orcid.xml |\
 sqlite3 orcid.sqlite

The next step is to produce a gexf+xml file to play with the orcid graph in gephi.
I use the following bash script to convert the sqlite3 database to gexf+xml.
DB=orcid.sqlite

cat << EOF
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" version="1.2">
<meta>
<creator>Pierre Lindenbaum</creator>
<description>Orcid Graph</description>
</meta>
<graph defaultedgetype="undirected" mode="static">

<attributes class="node">
<attribute type="string" title="affiliation" id="0"/>
</attributes>
<nodes>
EOF

sqlite3 -separator ' ' -noheader  ${DB} 'select orcid,name,affiliation from author' |\
 sed  -e 's/&/&/g' -e "s/</\</g" -e "s/>/\>/g" -e "s/'/\'/g"  -e 's/"/\"/g' |\
 awk -F ' ' '{printf("<node id=\"%s\" label=\"%s\"><attvalue for=\"0\" value=\"%s\"/></node>\n",$1,$2,$3);}'

echo "</nodes><edges>"
sqlite3 -separator ' ' -noheader  ${DB} 'select orcid1,orcid2 from collab' |\
 awk -F ' ' '{printf("<edge source=\"%s\" target=\"%s\"/>\n",$1,$2);}'
echo "</edges></graph></gexf>"



The output is saved and then loaded into gephi.






That's it,

Pierre

04 March 2012

Java Remote Method Invocation (RMI) for Bioinformatics

"Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts. "[Oracle] In the current post a java client will send a java class to the server that will analyze a DNA sequence fetched from the NCBI, using the RMI technology.

Files and directories

I In this example, my files are structured as defined below:
./sandbox/client/FirstBases.java
./sandbox/client/GCPercent.java
./sandbox/client/SequenceAnalyzerClient.java
./sandbox/server/SequenceAnalyzerServiceImpl.java
./sandbox/shared/SequenceAnalyzerService.java
./sandbox/shared/SequenceAnalyzer.java
./client.policy
./server.policy

The Service: SequenceAnalyzerService.java

The remote service provided by the server is defined as an interface named SequenceAnalyzerService: it fetches a DNA sequence for a given NCBI-gi, processes the sequence with an instance of SequenceAnalyzer (see below) and returns a serializable value (that is to say, we can transmit this value through the network).

Extract a value from a DNA sequence : SequenceAnalyzer

The interface SequenceAnalyzer defines how the remote service should parse a sequence. A SAX Parser will be used by the 'SequenceAnalyzerService' to process a TinySeq-XML document from the NCBI. The method characters is called each time a chunck of sequence is found. At the end, the remote server will return the value calculated from getResult:

Server side : an implementation of SequenceAnalyzerService

The class SequenceAnalyzerServiceImpl is an implementation of the service SequenceAnalyzerService. In the method analyse, a SAXParser is created and the given 'gi' sequence is downloaded from the NCBI. The instance of SequenceAnalyzer received from the client is invoked for each chunck of DNA. At the end, the "value" calculated by the instance of SequenceAnalyzer is returned to the client through the network. The 'main' method contains the code to bind this service to the RMI registry:

Client side

On the client side, we're going to connect to the SequenceAnalyzerService and send two distinct implementations of SequenceAnalyzer. What's interesting here: the server doesn't know anything about those implementations of SequenceAnalyzer. The client's java compiled classes have to be sent to the service.

GCPercent.java

A first implementation of 'SequenceAnalyzer' computes the GC% of a sequence:

FirstBases

The second implementation of 'SequenceAnalyzer' retrieves the first bases of a sequence.

The Client

And here is the java code for the client. The client connects to the RMI server and invokes 'analyse' with the two instances of SequenceAnalyzer for some NCBI-gi:

A note about security

As the server/client doesn't want to receive some malicious code, we have to use some policy files:
server.policy:

client.policy:

Compiling and Running

Compiling the client

javac -cp . sandbox/client/SequenceAnalyzerClient.java

Compiling the server

javac -cp . sandbox/server/SequenceAnalyzerServiceImpl.java

Starting the RMI registry

${JAVA_HOME}/bin/rmiregistry

Starting the SequenceAnalyzerServiceImpl

$ java \
 -Djava.security.policy=server.policy \
 -Djava.rmserver.codebase=file:///path/to/RMI/ \
 -cp . sandbox.server.SequenceAnalyzerServiceImpl

SequenceAnalyzerService bound.

Running the client

$ java  \
 -Djava.rmi.server.codebase=file:///path/to/RMI/ \
 -Djava.security.policy=client.policy  \
 -cp . sandbox.client.SequenceAnalyzerClient  localhost

gi=25 gc%=2.1530612244897958
gi=25 start=TAGTTATTC
gi=26 gc%=2.1443298969072164
gi=26 start=TAGTTATTAA
gi=27 gc%=2.3022222222222224
gi=27 start=AACCAGTATTA
gi=28 gc%=2.376543209876543
gi=28 start=TCGTA
gi=29 gc%=2.2014742014742015
gi=29 start=TCTTTG
That's it, Pierre

28 June 2011

Bioinformatician 2.0 ( JeBiF Workshop 2011 )

Here is the presentation I gave for the JeBiF Workshop 2011 (jobs and careers in bioinformatics).







Pierre Legrain (CEA) and Franck Molina, CNRS talking about working in Bioinformatics in the industry and/or academia.

08 February 2011

Visualizing my twitter network with Zoom.it

I wrote a small Java tool to download my twitter network as a GEXF file. This tool is available on github at:


java -jar twittergraph.jar -o twittergraph.gexf 7431072 #my twitter ID


This tool doesn't use the OAuth API, so it have to wait for a few minutes, and retry to connect, every times it reaches the twitter API quotas (150 requests per hour). In the end it took one night to download the data from my network (~390 friends).

<gexf
xmlns="http://www.gexf.net/1.1draft"
xmlns:viz="http://www.gexf.net/1.1draft/viz"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.1"
xsi:schemaLocation="http://www.gexf.net/1.1draft http://www.gexf.net/1.1draft/gexf.xsd">

<meta lastmodifieddate="2011-02-04">
<creator>Gephi 0.7</creator>
<description/>
</meta>
<graph defaultedgetype="directed" timeformat="double" mode="dynamic">
<attributes class="node" mode="static">
<attribute id="name" title="name" type="string"/>
<attribute id="screenName" title="screenName" type="string"/>
<attribute id="imageUrl" title="imageUrl" type="string">
<default>http://a3.twimg.com/sticky/default_profile_images/default_profile_1_reasonably_small.png</default>
</attribute>
<attribute id="location" title="location" type="string"/>
<attribute id="description" title="description" type="string"/>
<attribute id="protectedProfile" title="protectedProfile" type="boolean"/>
<attribute id="friends" title="friends" type="integer"/>
<attribute id="followers" title="followers" type="integer"/>
<attribute id="listed" title="listed" type="integer"/>
<attribute id="utc_offset" title="utc offset" type="integer"/>
<attribute id="statuses_count" title="statuses count" type="integer"/>
</attributes>
<nodes>
<node id="6612402" label="sciencebase">
<attvalues>
<attvalue for="name" value="David Bradley"/>
<attvalue for="screenName" value="sciencebase"/>
<attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/1142396198/twitter-blue-bradley_normal.jpg"/>
<attvalue for="location" value="Cambridge, UK"/>
<attvalue for="description" value="Science Writer David Bradley based in Cambridge, UK. Physical and life sciences news and views + technology, internet, web commentary."/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="2022"/>
<attvalue for="followers" value="9197"/>
<attvalue for="listed" value="1065"/>
<attvalue for="utc_offset" value="0"/>
<attvalue for="statuses_count" value="7526"/>
</attvalues>
</node>
<node id="19344270" label="EMBOcomm">
<attvalues>
<attvalue for="name" value="Suzanne Beveridge"/>
<attvalue for="screenName" value="EMBOcomm"/>
<attvalue for="imageUrl" value="http://a0.twimg.com/profile_images/1189685782/S_Beveridge5100_normal.JPG"/>
<attvalue for="location" value="Heidelberg"/>
<attvalue for="description" value="Follow me for the latest from EMBO, the European Molecular Biology Organization"/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="396"/>
<attvalue for="followers" value="697"/>
<attvalue for="listed" value="59"/>
<attvalue for="utc_offset" value="3600"/>
<attvalue for="statuses_count" value="632"/>
</attvalues>
</node>
<node id="20153702" label="walshtp">
<attvalues>
<attvalue for="name" value="Tom Walsh"/>
<attvalue for="screenName" value="walshtp"/>
<attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/644287976/IMG_0815_normal.JPG"/>
<attvalue for="location" value="Dundee, Scotland"/>
<attvalue for="description" value="Scientific programmer and sysadmin. "/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="129"/>
<attvalue for="followers" value="99"/>
<attvalue for="listed" value="8"/>
<attvalue for="utc_offset" value="0"/>
<attvalue for="statuses_count" value="783"/>
</attvalues>
</node>
<node id="15150655" label="konradfoerstner">
<attvalues>
<attvalue for="name" value="Konrad Förstner"/>
<attvalue for="screenName" value="konradfoerstner"/>
<attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/643611092/konrad_avantar2_normal.jpeg"/>
<attvalue for="location" value="here and there"/>
<attvalue for="description" value="Idealist, Scientist, Includist, Data analyst, Open Source|Data|Access, Coder, Command line friend, CouchSurfer, Konrad"/>
<attvalue for="protectedProfile" value="false"/>
<attvalue for="friends" value="266"/>
<attvalue for="followers" value="167"/>
<attvalue for="listed" value="17"/>
<attvalue for="utc_offset" value="3600"/>
<attvalue for="statuses_count" value="1948"/>
</attvalues>
</node>

(...)

<edge id="E3811" source="14899756" target="14295341">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4816" source="14899756" target="19542750">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4830" source="14899756" target="60065276">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E339" source="14899756" target="617133">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4807" source="14899756" target="15276911">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4822" source="14899756" target="26506721">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4824" source="14899756" target="27023131">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4819" source="14899756" target="22406785">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4808" source="14899756" target="16170580">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E1237" source="14899756" target="4339911">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4828" source="14899756" target="56564230">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4826" source="14899756" target="33838201">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
<edge id="E4815" source="14899756" target="19002481">
<attvalues>
<attvalue for="weight" value="1.0"/>
</attvalues>
</edge>
</edges>
</graph>
</gexf>



The GEXF file was then opened with Gephi, processed with the ForceAtlas algorithm and exported as a PDF file.

The PDF file was uploaded on scribd: http://www.scribd.com/doc/48415306/My-Twitter-Network


I then, downloaded the PDF from scribd.com, quickly copied the URL of the generated PDF and pasted it into http://zoom.it/.

Here is the result ! :-)



That's it !

Pierre

03 February 2011

Using the #Gephi toolkit to draw a graph from PSI-MI data, my notebook

GEPHI, interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs. Recently, a java library , the Gephi toolkit has been released and was used by LinkedIn for generating its inMaps.

I've been playing with the Toolkit, to generate a graph from a PSI-MI file downloaded from EMBL-Strings. The source code is available on github:



Shortly: the program reads the PSIMI-XML, uses XPATH to find the nodes and the edges, creates the graph (I could also have used XSLT too, to create an internal GEXF (the native file format for Gephi) file from the PIS-MI file), applies a layout algorithm (YifanHuLayout) and outputs the result (SVG, PDF, GEXF... ). Nevertheless it was not clear how I could change the output to insert a hyperlink, to change the background color, etc... The online javadoc was missing many informations.

Compilation

javac -cp /path/to/gephi-toolkit.jar -sourcepath src -d src src/sandbox/PsimiWithGephi.java

Generating a PDF for HOPX

I've downloaded the psi-mi.xml for HOPX from the embl-strings database and run the program.
java -cp /path/to/gephi-toolkit.jar:src sandbox.PsimiWithGephi -o ~/result.pdf file.xml

Result


Viewing the result with Flash

The API can export a GEXF file too and it can be visualized using a flash application named GEXF Explorer:

Note: "Sigma" is another viewer available for gexf: http://ofnodesandedges.com/sigma-neighborhoods-exploration/

That's it,

Pierre

03 May 2010

My new position at INSERM/UMR915

I'm pleased to announce that I've started a new position as a postdoc in Nantes/France at INSERM/U915 , the Thorax Institute for 3 years. The Institute is face up to the Loire and works in close collaboration with the neighboring Hotel-Dieu Hospital.


My team is led by Dr Richard Redon who worked at the Sanger institute on the Copy Number Variations . I'll be part of a project studing the genetics of the Brugada Syndrome using NGS. I'm just starting on this technology so I'm currently learning how to use all those new tools such as R, MAQ, SAMTOOLS , BWA , etc.... biostar.stackexchange.com has been a useful source of information here.

I applied for this position a few days before leaving to Biohackathon 2010 and I started to work in the lab on April 1st. I've been said that my professional network has been of help to get this job, especially Jan Aerts (my new best friend ! ;-) ) (thanks Jan !) who worked with Dr Redon at the Sanger Insitute.
Your browser does not support the <CANVAS> element !


More to come... ! :-)

That's it

Pierre