10 June 2007

I will not SPAM Nature Network with NCBI/Pubmed.

I didn't noticed it before, but it is possible to send some batch invitations to join a group on Nature Network (and not your own private network). All you need is the first and last names and the e-mails of your recipients. Moreover, the XML output of PUBMED contains a tag <Affiliation> wich sometimes contains the email of the author to contact. So I imagined that I could build a list just by extracting this e-mail and finding the first and last name of the person in the author list.

Here is my notebook about the java program I wrote:
I used the Streaming API for XML (StAx) API to parse the XML (I already introduced this technology in a previous post)

In order to link an email to his woner I needed a class to store some information about each author



/**
* Author
* @author pierre
*
*/

private static class Author
{
String ArticleTitle="";
int count=0;
String Suffix="";
String LastName="";
String FirstName="";
String MiddleName="";
String Initials="";
@Override
public String toString() {
return FirstName+" "+LastName;
}
}


Each mail was mapped to an author. When the same e-mail was found twice, I kept the author with the longest FirstName.

/** map mail to author */
private HashMap<String,Author> mail2author= new HashMap<String, Author>();



The followin method was invoked when an <Author> tag was found and it returned a new Object Author.

/** parse the &lt;Author&gt; tag */

private Author parseAuthor(XMLEventReader reader) throws IOException,XMLStreamException
{
XMLEvent evt;

Author author= new Author();
while(!(evt=reader.nextEvent()).isEndDocument())

{
if(evt.isEndElement())
{
return author;

}
if(!evt.isStartElement()) continue;
String tag= evt.asStartElement().getName().getLocalPart();

String content= reader.getElementText().trim();

if(tag.equals("LastName"))
{
author.LastName= content;

}
else if(tag.equals("FirstName") || tag.equals("ForeName"))

{
author.FirstName= content;
}
else if(tag.equals("Initials"))

{
author.Initials= content;
}
else if(tag.equals("MiddleName"))

{
author.MiddleName= content;
}
else if(tag.equals("CollectiveName"))

{
return null;
}
else if(tag.equals("Suffix"))

{
author.Suffix= content;
}
else

{
debug("###ignoring "+tag+"="+content);
}

}
throw new IOException("Cannot parse Author");
}


To get a pubmed XML from the NCBI you must first send a "Search" query containing your terms and retrieve some keys wich will be used to retrieve the records.


URL url= new URL(
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="+
URLEncoder.encode(term, "UTF-8")+
"&retstart=0&retmax="+max_return+"&usehistory=y&retmode=xml&email=plindenbaum_at_yahoo.fr&tool=mail");


XMLEventReader reader= newReader(url);
XMLEvent evt;
String QueryKey=null;

String WebEnv=null;
int countId=0;
while(!(evt=reader.nextEvent()).isEndDocument())

{
if(!evt.isStartElement()) continue;
String tag= evt.asStartElement().getName().getLocalPart();

if(tag.equals("QueryKey"))
{
QueryKey= reader.getElementText().trim();

}
else if(tag.equals("WebEnv"))
{

WebEnv= reader.getElementText().trim();
}

else if(tag.equals("Id"))
{
++countId;

}
}
reader.close();


You can now ask to fetch the data. The problem is that StaX does not like the NCBI DTD delcaration (I don't know why, may be the problem comes from the way I instancied the parser). However I have ignored the problem by extending a new java.io.Reader ignoring the second line containing the DTD (This is ugly but it works)

/**
* got some problem with the DTD of the NCBI. This reader ignores the second line
* of the returned XML
* @author pierre
*
*/


private class IgnoreLine2 extends Reader
{
Reader delegate;
boolean found=false;

IgnoreLine2(Reader delegate)
{
this.delegate=delegate;

}

@Override
public int read() throws IOException
{

int c= this.delegate.read();
if(c==-1) return c;

if(c=='\n' && !found)
{
while((c=this.delegate.read())!=-1)

{
if(c=='\n') break;
}

found=true;
return this.read();
}

return c;
}

@Override
public int read(char[] cbuf, int off, int len) throws IOException
{

if(found) return this.delegate.read(cbuf,off,len);

int i=0;
while(i)
{

int c= read();
if(c==-1) return (i==0?-1:i);

cbuf[off+i]=(char)c;
++i;

}
return i;
}
@Override
public void close() throws IOException {

delegate.close();
}
}


Now, everytime we find a PubmedRecord with an Affiliation tag we can extract the mail and link it to its author:


if(!authors.isEmpty() &&
Affiliation!=null &&
Affiliation.indexOf('@')!=-1)

{
for(String mail: Affiliation.split("[ \t\\:\\<,\\>\\(\\)]"))

{
mail.replaceAll("\\{\\}", "");
if(mail.endsWith(".")) mail= mail.substring(0,mail.length()-1);

int index=mail.indexOf('@');
if(index==-1) continue;

String mailPrefix=mail.substring(0,index).toLowerCase();



boolean found=false;
for(Author a: authors)
{

if(mailPrefix.contains(a.LastName.toLowerCase()) ||
collator.compare(mailPrefix, a.LastName)==0)

{
++count;
addAuthor(mail.toLowerCase(),a);

found=true;
break;
}
}
//search on firstName

if(!found)
{
for(Author a: authors)
{

if(a.FirstName.length()>1 && (mailPrefix.contains( a.FirstName.toLowerCase()) ||
collator.compare(mailPrefix, a.FirstName)==0))

{
++count;
addAuthor(mail.toLowerCase(),a);

found=true;
break;
}
}
}

if(found) break;

debug("\nFailed to find author:\nMail:"+mail+"\nAffiliation: "+Affiliation+"\nAuthors: "+authors.toString()+"\n");


Here is the output I get for the query "Bioinformatics" (about 1350 distinct rows)


E M... Zdob... evgueni.zdobnov@xxxxx.ac.uk
R F... Dool... rdoolittle@xxxxx.edu
Jin... Li... jingli@xxxxx.case.edu
Dav... Vene... davenet@xxxxx.ac.be
A... Anto... anestis.antoniadis@xxxxx.fr
J... Hamp... j.hampe@xxxxx.de
Nik... Beer... beerenwinkel@xxxxx.mpg.de
Tee... Kivi... teemu.kivioja@xxxxx.helsinki.fi
H P... Shan... shanahan@xxxxx.ac.uk
Eug... Novi... eugene.novikov@xxxxx.fr
A G... de B... debrevern@xxxxx.jussieu.fr
Rac... Karc... rachelk@xxxxx.ucsc.edu
Tua... Pham... t.pham@xxxxx.edu.au
E... Harl... eharley@xxxxx.toronto.edu
Ste... Wood... swooding@xxxxx.utah.edu
Mus... Asya... asyali@xxxxx.edu.sa
Del... Duec... delbert@xxxxx.toronto.edu
Bur... Morg... burkhard@xxxxx.uni-bielefeld.de
Gia... Manc... mancosu@xxxxx.it
(...)
Kim... Sjö... kimmen@xxxxx.berkeley.edu
P... Nico... nicodeme@xxxxx.cnrs.fr
Orl... Gonz... gonzalez@xxxxx.ifi.lmu.de
Wol... Hube... huber@xxxxx.ac.uk
Tho... Mail... mailund@xxxxx.dk
Fra... DiMa... dimaio@xxxxx.wisc.edu
G... Ball... graham.balls@xxxxx.ac.uk
Jia... Xu... jxu@xxxxx.wustl.edu
Lut... Krau... lutz.krause@xxxxx.uni-bielefeld.de
W... Newe... william.newell@xxxxx.com
A... Hege... heger@xxxxx.ac.uk
Sha... Jens... stjensen@xxxxx.upenn.edu
Hui... Liu... huiqing@xxxxx.a-star.edu.sg


I sent a mail to Nature Network to see if I could use this system to send many invitation to join the bioinformatics group as I didn't want to be considered as a spammer. I received this mail from Matt Brown:


Hi Pierre,

Wow, that's very proactive of you. In theory, that would be wonderful.
However, I believe that this may break regulations about sending
unsolicited emails. Laws and regulations in this area vary from country
to country. I would therefore urge you to only invite people to join a
group if you know them personally.

However, we're very keen to see your group grow. Instead of using email,
you might consider posting notices on other forums elsewhere on the web.

Best regards,
Matt

Matt Brown
Editor, Nature Network London


That's fair :-)

I also used the structure of this java program to create another one that I will discuss in my next post.

Pierre

1 comment:

William said...

You know, Pierre, I was trying to do exactly this a little while ago, and i ended up going to the CRISP database of NIH funded grants and writing a screen-scraper for piggybank, because writing a parser was just a little too advanced for me.

Would you send me the code you used?

I also promise that I will not spam nature Network or anyone else using it.