26 September 2006

MYSQL UDF, trees of data, hierarchy

In a previous post I described how to write a mysql user defined function (UDF) in C to create a translate-dna-to-protein function for mysql. Now, I've been playing with hierachies and I wrote a few UDFs to explore a tree of data; I guess this could have been done with a mysql stored procedure but I'm currently using an old server (and I need a deeper knowledge of mysql5 :-) ). The source code is available [here].
The tree itself is written as a sorted static const array of data so you cannot modify it via a mysql command but you'll have to recompile the code.

typedef struct Taxonomy
{
/** the ncbi-id */
taxon_type_t tax_id;
/** id of the parent */
taxon_type_t parent_id;
/** scientific name */
char name[TAXON_NAME_SIZE];
}Taxon,*const TaxonPtr;


It may also not be suitable for large trees. In my example, I've been using a subset of the NCBI taxonomy:

static const Taxon all_taxons[]={
{1, -1, "root"},
{2759, 131567, "Eukaryota"},
{6072, 33208, "Eumetazoa"},
{7711, 33511, "Chordata"},
{7742, 89593, "Vertebrata"},
{7776, 7742, "Gnathostomata"},
{8287, 117571, "Sarcopterygii"},
{9347, 32525, "Eutheria"},
{9443, 314146, "Primates"},
...)
};


but one could imagine something like this...

{1,-1,"root"},
{2,1,"world"},
{3,2,"europe"},
{4,3,"france"},
...


or...

{1,-1,"rdf:Property"},
{2,1,"foaf:knows"},
{3,2,"rel:friendOf"},
{4,2,"rel:childOf"},
...


the source taxonudf.c was successfuly compiled on my computer using the following command-line.

gcc -fPIC -shared -DDBUG_OFF -O3 -I/usr/include/mysql -lmysqlclient -o /usr/lib/taxonudf.o taxonudf.c


taxon_name returns the scientific name of an organism from a ncbi-tax-id.

mysql> create function taxon_name
returns string soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_name(9606);
+------------------+
| taxon_name(9606) |
+------------------+
| Homo sapiens |
+------------------+
1 row in set (0,43 sec)


tax_id returns the ncbi-tax-id from from its name.

mysql> create function taxon_id
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_id("Homo sapiens");
+--------------------------+
| taxon_id("Homo sapiens") |
+--------------------------+
| 9606 |
+--------------------------+
1 row in set (0,00 sec)


taxon_childof returns wether a node in the hierarchy is a descendant of another node.

mysql> create function taxon_childof
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> select taxon_childof(taxon_id("Homo"),taxon_id("Homo Sapiens"))
as "is Homo child of Homo.Sapiens ?";
+---------------------------------+
| is Homo child of Homo.Sapiens ? |
+---------------------------------+
| 0 |
+---------------------------------+
1 row in set (0,00 sec)


mysql> select taxon_childof(taxon_id("Homo Sapiens"),taxon_id("Homo"))
as "Is Homo.Sapiens descendant of Homo ?";
+--------------------------------------+
| Is Homo.Sapiens descendant of Homo ? |
+--------------------------------------+
| 1 |
+--------------------------------------+
1 row in set (0,00 sec)



taxon_com is an aggregate function which finds the common ancestral node in a set of node.

mysql> create aggregate function taxon_com
returns integer soname "taxonudf.o";
Query OK, 0 rows affected (0,00 sec)

mysql> create temporary table t1(cluster varchar(20),taxon int);
Query OK, 0 rows affected (0,07 sec)

mysql> insert into t1(cluster,taxon) values("A",251093),
("A",9781), ("A",37348),("B",9605),("B",9606),("B",63221),
("C",32523),("C",33154),("C",7776),("C",9443);
Query OK, 10 rows affected (0,03 sec)
Records: 10 Duplicates: 0 Warnings: 0

mysql> select cluster,taxon as "ncbi-id",taxon_name(taxon) as "Name",taxon_childof(taxon,taxon_id("Primates")) as "Is_Primate" from t1;
+---------+---------+-------------------------------+------------+
| cluster | ncbi-id | Name | Is_Primate |
+---------+---------+-------------------------------+------------+
| A | 251093 | Elephas antiquus | 0 |
| A | 9781 | Elephantidae gen. sp. | 0 |
| A | 37348 | Mammuthus | 0 |
| B | 9605 | Homo | 1 |
| B | 9606 | Homo sapiens | 1 |
| B | 63221 | Homo sapiens neanderthalensis | 1 |
| C | 32523 | Tetrapoda | 0 |
| C | 33154 | Fungi/Metazoa group | 0 |
| C | 7776 | Gnathostomata | 0 |
| C | 9443 | Primates | 1 |
+---------+---------+-------------------------------+------------+
10 rows in set (0,00 sec)

mysql> select cluster,taxon_com(taxon) as ncbi_id
from t1 group by cluster;
+---------+---------+
| cluster | ncbi_id |
+---------+---------+
| A | 9780 |
| B | 9605 |
| C | 33154 |
+---------+---------+
3 rows in set (0.00 sec)


That's all folks

03 September 2006

Scott McCloud at SciFoo 2006

I've just discovered on Flickr that Scott McCloud was present at SciFoo 2006.

scifoo


McLoud is the author of Understanding comics, one of the best book about comics I've ever read.

scifoo


In this book, McLoud introduced a map of visual iconography that took the shape of a triangle.

The lower left corner was visual resemblance (e.g., photography and realistic painting). The lower right included the products of what he called iconic abstraction (e.g., cartooning). And at the top were the denizens of the picture plane ("pure" abstraction) which ceased to make reference to any visual phenomena other than themselves. The move from realism to cartoons along the bottom edge was a move away from resemblance that still retained "meaning," so words, the next logical step in the progression, were included at far right, thereby enclosing anything in comics' visual vocabulary between the three points.

See also: http://www.scottmccloud.com/inventions/triangle/triangle.html