Extracting Scientists &SF writers from Wikipedia.
Images via wikipedia
In a recent post on FriendFeed, Christopher Harris asked: do you know of any science fiction writer who is/was also a scientist?. My first approach to automatically retrieve those names, was to use Freebase. For example, the following MQL query retrieves the Scientists and the SF Writers.
[{
"id":null,
"name":null,
"type" : "/people/person",
"a:profession":[{"name":"Scientist"}],
"b:profession":[{"name":"Science-Fiction Writer"}],
"limit":100
}]The MQL query Editor returned the following result:
Then I wrote a java tool extracting the pages having a given WP category using the wikipedia API. This tool, "wpsubcat" is available here: http://code.google.com/p/lindenb/downloads/list and requires BerkeleyDB java Edition in order to store the temporary results. The source code is available here: WPSubCat.
Retrieve all the subClasses of 'Category:Scientists'
Retrieve all the scientists.
After a series of 'sort' and 'comm', the result is the following list (in fact, it is underestimated, I've sightly improved the way the sub-categories are retrieved) :
That's it
Pierre
[{
"id":null,
"name":null,
"type" : "/people/person",
"a:profession":[{"name":"Scientist"}],
"b:profession":[{"name":"Science-Fiction Writer"}],
"limit":100
}]
{
"code": "/api/status/ok",
"result": [
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/edward_llewellyn",
"name": "Edward Llewellyn",
"type": "/people/person"
},
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/konrad_fialkowski",
"name": "Konrad Fiałkowski",
"type": "/people/person"
}
],
"status": "200 OK",
"transaction_id": "cache;cache01.p01.sjc1:8101;2009-09-27T15:58:11Z;0002"
}
Only two persons ! That's not much, because the articles in Wikipedia, as well as in Freebase are classified using a hierarchical Categories (sadly, it is not an acyclic graph), but there is no tool to find the articles matching the sub-categories. So , you'll have to repeat this quety for the "British scientists", the "French Biologists", etc... (by the way, I think wikipedians should not have allowed to mix two distinct kind of categories (e.g. profession and nationality).It messes-up the classification). (do you know if this can be achieved using SPARQL and DBPedia ?)"code": "/api/status/ok",
"result": [
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/edward_llewellyn",
"name": "Edward Llewellyn",
"type": "/people/person"
},
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/konrad_fialkowski",
"name": "Konrad Fiałkowski",
"type": "/people/person"
}
],
"status": "200 OK",
"transaction_id": "cache;cache01.p01.sjc1:8101;2009-09-27T15:58:11Z;0002"
}
Then I wrote a java tool extracting the pages having a given WP category using the wikipedia API. This tool, "wpsubcat" is available here: http://code.google.com/p/lindenb/downloads/list and requires BerkeleyDB java Edition in order to store the temporary results. The source code is available here: WPSubCat.
Usage
-debug-level <java.util.logging.Level> default:OFF
-base <url> default:http://en.wikipedia.org
-ns <int> restrict results to the given namespace default:14 (Category)
-db-home BerkeleyDB default directory:/tmp/bdb
-d <integer> max recursion depth default:3
-add <category> add a starting article
OR
(stdin|files) containing articles' titles
-base <url> default:http://en.wikipedia.org
-ns <int> restrict results to the given namespace default:14 (Category)
-db-home BerkeleyDB default directory:/tmp/bdb
-d <integer> max recursion depth default:3
-add <category> add a starting article
OR
(stdin|files) containing articles' titles
Examples
Retrieve all the subClasses of 'Category:Scientists'
java -cp je-3.3.75.jar:wpsubcat.jar org.lindenb.tinytools.WPSubCat \
-add "Category:Scientists" > catscientists.txt
-add "Category:Scientists" > catscientists.txt
Retrieve all the scientists.
java -cp je-3.3.75.jar:wpsubcat.jar org.lindenb.tinytools.WPSubCat \
-ns 0 -d 0 catscientists.txt > scientists.txt
-ns 0 -d 0 catscientists.txt > scientists.txt
Result
After a series of 'sort' and 'comm', the result is the following list (in fact, it is underestimated, I've sightly improved the way the sub-categories are retrieved) :
- A._Langley_Searles
- Aldous_Huxley
- Aleksey_Nikolayevich_Tolstoy
- Alexander_Bogdanov
- Alpheus_Hyatt_Verrill
- Antoni_Lange
- Archibald_Low
- Arthur_C._Clarke
- Arthur_Conan_Doyle
- Blaine_Pardoe
- Caitlín_R._Kiernan
- Camille_Flammarion
- Carl_Sagan
- Chad_Oliver
- Chandler_Davis
- Charles_Howard_Hinton
- Charles_Platt
- Charles_Sheffield
- Clifford_A._Pickover
- Cordwainer_Smith
- Dan'l_Danehy_Oakes
- David_Jay_Brown
- David_J._Skal
- Donald_Kingsbury
- Dorothy_J._Heydt
- Douglas_Adams
- Duncan_Lunan
- Edgar_Allan_Poe
- Edward_Llewellyn
- Edwin_Fitch_Northrup
- Elf_Sternberg
- Ellen_Beeman
- Eric_Temple_Bell
- Forrest_J_Ackerman
- F._Paul_Wilson
- François_Bordes
- František_Běhounek
- Fred_Hoyle
- Garrett_P._Serviss
- Gary_L._Bennett
- Genrich_Altshuller
- George_Edward_Pendray
- George_Guthridge
- George_Tucker
- Gharlane_of_Eddore
- Gregory_Benford
- György_Kulin
- H._Chandler_Elliott
- Henry_Gee
- Herbert_W._Franke
- Hideaki_Sena
- Ian_Hocking
- Isaac_Asimov
- Ivan_Yefremov
- Jack_C._Haldeman_II
- Jack_Williamson
- Jacques_Vallée
- James_Cooke_Brown
- James_L._Halperin
- Janet_Morris
- Jayant_Narlikar
- J._Michael_Straczynski
- Joan_Slonczewski
- Johannes_Kepler
- John_G._Cramer
- John_R._Pierce
- Joseph_Samachson
- Jo_Walton
- Józef_Sękowski
- Judith_Berman
- Justin_Leiber
- Kathleen_Ann_Goonan
- Kelly_McCullough
- Kim_Newman
- Kir_Bulychev
- Kirill_Eskov
- Konrad_Fiałkowski
- Konstantin_Tsiolkovsky
- Leon_Stover
- Lois_H._Gresh
- Marek_Huberath
- Mary_Doria_Russell
- Meredith_L._Patterson
- Michael_Crichton
- Milan_Šufflay
- Miroslav_Žamboch
- Nick_Webb_(author)
- Ovidiu_Pecican
- Patrick_Moore
- Paul_Levinson
- Peter_Watts
- P._J._Plauger
- R._H._Barlow
- Richard_Garfinkle
- Robert_A._Heinlein
- Robert_Anton_Wilson
- Robert_E._Vardeman
- Robert_L._Forward
- Robert_S._Richardson
- Robert_W._Wood
- Robert_Zubrin
- Ron_Goulart
- Rudy_Rucker
- Ryk_E._Spoor
- Sergey_Lukyanenko
- Shaun_A._Saunders
- Simon_Newcomb
- Slaven_Jelenović
- Stanley_Schmidt
- Suzette_Haden_Elgin
- Thomas_Easton
- Tong_Enzheng
- Vernor_Vinge
- Victor_Anestin
- Vid_Pečjak
- Vladimir_Obruchev
- Voltaire
- Vyacheslav_Rybakov
- William_Kenneth_Hartmann
- William_R._Forstchen
- William_Schoell
- Yoji_Kondo
- Zhang_Xiguo
That's it
Pierre
3 comments:
The DBPedia SPARQL end point can be found at:
http://dbpedia.org/sparql
Cool, but you missed Alastair Reynolds who was working at the ESA until last year, and Joe Haldeman (only a BSc in science admittedly).
@baoilleach: those persons are missing because they haven't been 'categorized'. E.g. Haldeman was classified with the following categories (2009-09-28):
1943 births | Living people | People from Oklahoma City, Oklahoma | People from Gainesville, Florida | American science fiction writers | Military science fiction writers | American novelists | Writers from Oklahoma | Hugo Award winning authors | Nebula Award winning authors | University of Maryland, College Park alumni | American military personnel of the Vietnam War | Worldcon Guests of Honor | Iowa Writers' Workshop alumni.
None of those categories is a sub-class of 'Category:Scientist' (depth=3).
Same comment for Reynolds, he was categorized as "Alumni_of_Newcastle_University", but this category is not a sub-class of 'Scientist' (depth=3).
Post a Comment