14 April 2011

"404 not found": a database of non-functional resources in the NAR database collection

Today, Andra Waagmeester asked on Biostar :"NAR nicely lists all their database issues on http://www.oxfordjournals.org/nar/database/c/. Is the list also available in a downloadable format?".

I suggested to download from pubmed all the articles published in an annual issue of NAR , to extract the URLs from the abstract and to check if they were still active. I just wrote a java program doing this job (it is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/NucleicAcidsResearch404.java)

A few comments:
  • The connection timeout was fixed to 10 seconds.
  • Some URLs are poorly written e.g: http://www.ncbi.nlm.nih.gov/pubmed/14681415
  • An abstract can contain more than one URL
  • There can be different URLs for the same database
  • getting a HTTP:404 error doesn't mean that the database has really been discontinued.
  • getting a status HTTP:200 doesn't mean that the database is still active and/or maintained
  • 1155 URLs have been extracted from this pubmed query `"Nucleic Acids Res"[JOUR] "Database issue"[ISS]` (as far as I can see , this query only goes to 2004) Edit:ok, that was because NCBI eFetch is limited to 10K records


YearCount(URL)count(Active)%
200415710063
200516211470
200618614779
200719415881
200820618087
200920818689
201014713692
201120019396


... a snapshot of the output...


(...)

(...)

Credit for the Title: Neil Saunders ;-)


Update:
It seems that the URLs in the abstracts are broken where they were cut in the PDF !
via openwetware.org


That's it,
Pierre

9 comments:

  1. You should publish a paper on this Pierre

    ReplyDelete
  2. That will teach me to post my good ideas as comments before going to bed :-) Seriously, nice work and this would make a useful web application. I'm getting to work on that part right now.

    ReplyDelete
  3. Sorry Neil, I didn't want to steal your idea :-)

    ReplyDelete
  4. Is there a FigShare for that table?

    ReplyDelete
  5. This is definitely a good initiative. It will be very useful to have a web server (yes, another one) that lists all published web servers and their state (active / inactive / up / down). For better statistics, it will be interesting to re-run Pierre's script regularly (to eventually produce a percentage of accessibility).
    Pierre and Neil, your idea also reminds me this article "Databases, data tombs and dust in the wind", by Wren and Bateman, Bioinformatics, 24: 2127 (2008) http://bioinformatics.oxfordjournals.org/content/24/19/2127

    ReplyDelete
  6. This is a great service! Did you email all the authors of the 'dead' resources? It would be very interesting to know the cause of their demise.

    Wren & Bateman's editorial to the 2008 NAR databases issue also deserves a mention in this context: "Databases, data tombs and dust in the wind"
    http://bioinformatics.oxfordjournals.org/content/24/19/2127.long

    ReplyDelete
  7. Regarding the observation that the query only goes to 2004: this is not because of limits. It's because prior to that date, the issue (ISS) was not named "Database Issue" - it just had a volume number, like regular issues.

    ReplyDelete
  8. Some databases might move to a different location (moving lab from one university to another), hence change in the URL. NAR updates these links. For example, MPromDb is no longer at OSU. It is now available from the following link http://mpromdb.wistar.upenn.edu/

    ReplyDelete
  9. Doesn't http://biocatalogue.org already do this kind of service tracking?

    ReplyDelete