11 May 2010

Playing with the C API for sqlite: inserting the hapmap data , my notebook.



In this post, I'll insert a Hapmap genotype file into a SQLite database using the C API for SQLIte. Of course, this is just a test and I would not use SQLite to store this kind of information. Moreover the speed of insertion of SQLite was deadly slow here.

The Genotype file

The first line contains the name of the individuals, the other rows of the file contain one SNP per line and the genotype for each individual:
rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode NA06984 NA06985 NA06986 NA06989 NA06991 NA06993 NA06994 NA06995 NA06997 NA07000 NA07014 NA07019 NA07022 NA07029 NA07031 NA07034 NA07037 NA07045 NA07048 NA07051 NA07055 NA07056 NA07345 NA07346 NA07347 NA07348 NA07349 NA07357 NA07435 NA10830 NA10831 NA10835 NA10836 NA10837 NA10838 NA10839 NA10840 NA10843 NA10845 NA10846 NA10847 NA10850 NA10851 NA10852 NA10853 NA10854 NA10855 NA10856 NA10857 NA10859 NA10860 NA10861 NA10863 NA10864 NA10865 NA11829 NA11830 NA11831 NA11832 NA11839 NA11840 NA11843 NA11881 NA11882 NA11891 NA11892 NA11893 NA11894 NA11917 NA11918 NA11919 NA11920 NA11930 NA11931 NA11992 NA11993 NA11994 NA11995 NA12003 NA12004 NA12005 NA12006 NA12043 NA12044 NA12045 NA12056 NA12057 NA12144 NA12145 NA12146 NA12154 NA12155 NA12156 NA12234 NA12236 NA12239 NA12248 NA12249 NA12264 NA12272 NA12273 NA12275 NA12282 NA12283 NA12286 NA12287 NA12335 NA12336 NA12340 NA12341 NA12342 NA12343 NA12344 NA12347 NA12348 NA12375 NA12376 NA12383 NA12386 NA12399 NA12400 NA12413 NA12489 NA12546 NA12707 NA12708 NA12716 NA12717 NA12718 NA12739 NA12740 NA12748 NA12749 NA12750 NA12751 NA12752 NA12753 NA12760 NA12761 NA12762 NA12763 NA12766 NA12767 NA12775 NA12776 NA12777 NA12778 NA12801 NA12802 NA12812 NA12813 NA12814 NA12815 NA12817 NA12818 NA12827 NA12828 NA12829 NA12830 NA12832 NA12842 NA12843 NA12864 NA12865 NA12872 NA12873 NA12874 NA12875 NA12877 NA12878 NA12889 NA12890 NA12891 NA12892
rs6423165 A/G chrY 109805 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8572888:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG AG GG AG AG AG AA AA GG AA AG NN AG AA AG NN AA AG NN GG GG GG GG AA GG GG AG AG AA AG GG AG AG AG AA AG AG AG GG GG AA AA NN GG AG NN AA AG NN GG NN GG GG GG AA AG AA AA AG GG AG AG GG AG AA AG GG AG GG GG AG GG AG GG GG AG GG GG AA NN AG GG AG GG GG GG AG GG GG AA AA GG GG GG NN AG AG AG GG AG AG AA AA GG AA GG AA AG AA AG GG AG GG GG AG AG GG GG AG AG AG NN AG AG AG AG AG NN AA AA AG AG AG AG AG AA AA AG AG AA AG GG AG GG GG AG AG AG AG AG AG AA AG AG AG AG AA AA AG GG GG GG GG AG AG GG GG AG AA GG AG AG GG AG
rs6608381 C/T chrY 109922 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8528859:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ CC CT CC CT CT CT TT TT CC TT CT NN CT TT CT NN TT CC NN CC CC CC CC TT CC CC CT CT TT CT CC CT CT CT TT CT CT CT CC CC TT TT NN CC CT NN TT CT NN CC NN CC CC CC TT CT TT TT CT CC CT CT CC CT TT CT CC CT CC CC CT CC CT CC CC CT CC CC TT NN CT CC CT CC CC CC CT CC CC TT TT CC CC CC NN CT CT CT CC CT CT TT TT CC TT CC TT CT TT CT CC CT CC CC CT CT CC CC CT CT CT NN CT CT CT CT CT NN TT TT CT CT CT CT CT TT TT CT CT TT CT CC CT CC CC CT CT CC CT CT CC TT CT CT CT CT TT TT CT CC CC CC CC CT CT CC CC CT TT CC CT CT CC CT
rs6644972 A/G chrY 118624 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs6644972:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG GG AG GG AG AG AG GG AG GG AG NN GG GG AG NN GG GG NN AG AG GG GG GG GG GG GG GG GG AG AA GG GG GG AG GG GG GG GG GG GG GG NN GG AG GG GG GG NN AG NN AG AG AA GG GG GG GG GG GG GG GG GG AG GG GG AA AG GG AG GG GG GG GG GG GG GG AG AG NN AG GG AG GG GG GG AG GG GG GG GG AA AA AG NN GG GG GG GG GG GG GG AG GG GG GG GG AG GG GG AG GG GG GG GG GG GG GG GG GG GG AG GG GG GG GG GG NN GG GG AG GG GG AG GG AG GG AG AA GG GG GG GG AG GG GG GG AG GG AG GG GG GG AG GG GG AG GG AG GG GG GG GG GG GG AG GG GG GG GG GG GG GG GG
rs6655397 A/G chrY 141935 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs6655397:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG GG GG GG AG AG GG GG GG GG GG NN GG GG NN NN GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN AG GG GG GG GG NN AG NN GG GG GG NN AG GG GG GG AG GG GG AG GG GG GG GG AG GG AG GG GG GG GG AG AG GG GG GG NN AG GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG AG GG AG NN GG GG GG GG GG GG GG GG AG GG GG AG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG AG GG GG GG GG GG GG
rs6603172 C/T chrY 142439 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8527413:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ TT TT TT TT CT CT TT TT TT TT TT NN TT TT TT NN TT TT NN TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT NN CT TT TT TT TT NN CT NN TT TT TT TT CT TT TT TT CT TT TT CT TT TT TT TT CT TT CT TT TT TT TT CT CT TT TT TT NN CT TT TT TT TT TT TT TT TT TT TT TT TT TT NN TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT CT NN TT TT TT TT TT TT TT TT CT TT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT TT TT TT
rs6644970 A/G chrY 142664 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-4207883:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ AG AA AA AG AG GG AG AG AA AA GG NN AG AA GG NN AG AG NN AG AG AG GG AG AG AG AA AG GG GG AA AG AG AG AG AG AA AA AG GG AG GG NN GG AG NN AG AG NN AG NN AA AG AA AA AG GG GG AG AG AG AG AG AG AA AG AA AG GG GG AG AG AA GG AG GG AA AG AG NN AG AG GG AG GG GG AG GG GG AA AG AA AG GG NN GG AA GG AG AG GG GG AA GG AA AG AG GG GG AG AG GG AG AG AG GG AG AG AG AG AG NN AG GG AG AG GG NN AA AG AA AA GG AG AG AG GG AG AG AG GG AG AA AG GG AA AA AG GG AA AG AG GG AA GG AA AA AG AG AG AG AG GG AG GG GG AG AG AG AA AA GG AG AG
rs11019 A/G chrY 159978 + ncbi_B36 affymetrix urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2 urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-4207841:2 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1 QC+ NN AG NN NN AG GG AA NN NN AG NN AG GG AA NN AG NN NN GG NN GG AG GG NN NN GG NN GG NN GG AG GG NN NN GG AG NN NN NN AG AG NN NN NN NN AA AG GG GG AA AG GG GG NN NN GG GG AG GG AG AG NN AA AA NN NN NN NN NN NN NN NN NN NN GG AG GG AG AG GG AG AG GG GG NN GG GG AG AG GG GG AG AG AG AG AG GG GG GG NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN AG NN AA GG NN NN GG NN NN GG GG AG AA AG GG AA AA NN NN NN NN NN NN AG GG GG AG GG AG NN NN NN NN NN NN NN NN NN GG GG AG AG GG GG NN GG NN NN GG AG
rs1132353 C/T chrY 160116 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs1132353:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ CC CT TT CT CT TT NN CT TT CT CC NN TT CC CT NN TT CT NN CT TT CT CT TT TT CC TT CT CC TT CT TT TT TT TT CT TT TT TT CT CT TT NN TT CT NN CT TT NN CC NN TT TT CC NN TT TT CT TT CT CT CC CC CC CT CT CC CT CC CT CT TT TT TT TT CT TT CT CT NN CT CT TT TT TT TT TT CC CT TT TT CT CT CT NN CT TT TT TT TT TT TT TT TT TT TT CT CT CC TT CT TT CT CT CT CT CT CC CC CC CC TT CC CT CT CT CC NN TT CT TT TT CT TT CT CT CC CT TT CC CC TT TT TT CT TT TT CT TT TT CT TT CT TT TT CT TT CT TT CT CT CT TT TT CT CT TT TT CT TT TT CC TT CT
rs35047434 C/T chrY 169118 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs35047434:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ TT TT TT TT CT CT TT TT TT TT TT NN TT TT TT NN TT TT NN CT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT CT TT TT CT TT NN CT TT TT TT TT NN TT NN TT TT TT TT TT TT TT TT TT CT TT TT TT TT TT TT CT TT TT TT CT TT CT CT TT TT TT TT NN TT TT TT TT TT TT TT TT TT CT TT TT TT TT NN TT CT TT TT TT CT CT TT TT TT CT CT TT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT NN TT CT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT TT CT TT TT TT CT TT TT TT TT CT TT TT TT CC TT TT TT TT TT TT
(...)

Opening the SQLIte Database

if(sqlite3_open(env->fileout,&(env->connection))!=SQLITE_OK)
{
fprintf(stderr,"Cannot open sqlite file %s.\n",env->fileout);
exit(EXIT_FAILURE);
}

Closing the SQLIte Database

sqlite3_close(env->connection);

Creating a table

if(sqlite3_exec(env->connection,
"create table markers(id INTEGER PRIMARY KEY,name varchar(20) unique,chrom varchar(10),position integer)",
NULL,NULL,&error
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot create table markers: %s\n",error);
exit(EXIT_FAILURE);
}

Filling a prepared statement and inserting some data

if(sqlite3_prepare(env->connection,
"insert into markers(name,chrom,position) values(?,?,?)",
-1,&pstmt_insert_marker,NULL
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot compile insert into markers.\n");
exit(EXIT_FAILURE);
}
(...)

if(sqlite3_bind_text(
pstmt_insert_marker,1,
name,
strlen(name),
NULL)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's name.\n");
exit(EXIT_FAILURE);
}
(
if(sqlite3_bind_text(
pstmt_insert_marker,2,
chrom,
strlen(chrom),
NULL)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's chrom.\n");
exit(EXIT_FAILURE);
}
if( sqlite3_bind_int(
pstmt_insert_marker,3,
genomic_position
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's position.\n");
exit(EXIT_FAILURE);
}

if (sqlite3_step(pstmt_insert_marker) != SQLITE_DONE) {
fprintf(stderr,"Could not step insert marker.\n");
exit(EXIT_FAILURE);
}

marker_id = sqlite3_last_insert_rowid(env->connection);

sqlite3_reset(pstmt_insert_marker);

Creating a custom function for sqlite

Here I implemented a custom function named homozygous it returns true if a genotype is a string of two identical characters (but not 'N').
void is_homozygous(
sqlite3_context* context, /* context */
int argc, /* number of arguments */
sqlite3_value** argv /* arguments */
)
{
/* one text arg */
if(argc==1 &&
sqlite3_value_type(argv[0])==SQLITE_TEXT)
{
const char *alleles=(const char*)sqlite3_value_text(argv[0]);
if( alleles!=NULL &&
strlen(alleles)==2 &&
alleles[0]==alleles[1] &&
alleles[0]!='N')
{
/* set result to TRUE */
sqlite3_result_int(context,1);
return;
}
}
/* set result to FALSE */
sqlite3_result_int(context,0);
}

Telling sqlite about the new custom function

sqlite3_create_function(env->connection,
"homozygous",/* function name */
1,/* number of args */
SQLITE_UTF8,/* encoding */
NULL,
is_homozygous,/* the implementation callback */
NULL,
NULL
);

Implementing a callback for a simple 'SELECT' query

The following callback prints the results to stdout.
static int callback_select_homozygous(
void* notUsed,
int argc,/* number of args */
char** argv,/* arguments as string */
char** columns /* labels for the columns */
)
{
int i;
/* prints all the columns to stdout*/
for(i=0; i<argc; i++)
{
printf( "\"%s\": \"%s\"\n",
columns[i],
argv[i]!=NULL? argv[i] : "NULL"
);
}
printf("\n");
return 0;
}

Executing a SELECT query

Here, we select all the homoygous genotypes:
if(sqlite3_exec(
env->connection, /* An open database */
"select individuals.name,"
"markers.name,"
"markers.chrom,"
"markers.position,"
"genotypes.alleles "
"from "
"genotypes,"
"markers,"
"individuals "
"where "
"markers.id=genotypes.marker_id and "
"individuals.id=genotypes.individual_id and "
"homozygous(genotypes.alleles)=1"
, /* SQL to be evaluated */
callback_select_homozygous, /* Callback function */
NULL, /* 1st argument to callback */
&errormsg /* Error msg written here */
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot select.\n");
exit(EXIT_FAILURE);
}

The full source code



Excuting the C program

[LOG]inserted marker ID.1
[LOG]inserted marker ID.2
[LOG]inserted marker ID.3
[LOG]inserted marker ID.4
[LOG]inserted marker ID.5
(... WAIT !!! )
"name": "NA12878"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"

"name": "NA12890"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"

"name": "NA12892"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"


That's it

Pierre

No comments: