Validating JSON with lex & yac
In a recent post on Twitter, Chris Lasher/agbiotec said:
I did a quick Google for "JSON schema" and "JSON validation"; looks like there's nothing in plac e yet like XML schema..
I suggested that lex/yacc could be used to create a trivial tool for this kind of validation. Here is an example.
Say you have a linkage file expressed as JSON. This file contains some information about a set of genetic markers, a set of samples and some genotypes.
{
"markers":[
{
"id":1,
"name":"rs1",
"chrom":"chr1",
"position":1
},
{
"id":2,
"name":"rs2",
"chrom":"chr1",
"position":2
},
{
"id":3,
"name":"rs3",
"chrom":"chr1",
"position":3
}
],
"samples":[
{
"id":1,
"name":"Individual1",
"father-id":2,
"mother-id":3,
"illness":true
},
{
"id":2,
"name":"Individual2",
"father-id":0,
"mother-id":0,
"illness":false
},
{
"id":3,
"name":"Individual3",
"father-id":0,
"mother-id":0,
"illness":true
}
],
"genotypes" : [
{
"sample":1,
"marker":2,
"allele-1":"A",
"allele-2":"T"
},
{
"sample":2,
"marker":2,
"allele-1":"A",
"allele-2":"T"
}
]
}
"markers":[
{
"id":1,
"name":"rs1",
"chrom":"chr1",
"position":1
},
{
"id":2,
"name":"rs2",
"chrom":"chr1",
"position":2
},
{
"id":3,
"name":"rs3",
"chrom":"chr1",
"position":3
}
],
"samples":[
{
"id":1,
"name":"Individual1",
"father-id":2,
"mother-id":3,
"illness":true
},
{
"id":2,
"name":"Individual2",
"father-id":0,
"mother-id":0,
"illness":false
},
{
"id":3,
"name":"Individual3",
"father-id":0,
"mother-id":0,
"illness":true
}
],
"genotypes" : [
{
"sample":1,
"marker":2,
"allele-1":"A",
"allele-2":"T"
},
{
"sample":2,
"marker":2,
"allele-1":"A",
"allele-2":"T"
}
]
}
If we want to validate this file with lex/yacc or flex/bison. We need:
- A Scanner generated by bison. This scanner contains the grammatical rules.
- A Lexer generated by flex. This lexer transforms the input into a set of semantic tokens
sample.y: the Scanner
%{
#include <stdio.h>
int yywrap() { return 1;}
void yyerror(const char* s) {fprintf(stderr,"Error:%s.\n",s);}
%}
%token ALLELE_1 ALLELE_2 CHROM FATHER_ID GENOTYPES ID ILLNESS MARKER MARKERS MOTHER_ID NAME POSITION SAMPLE SAMPLES
%token BOOLEAN INTEGER STRING
%start linkage
%%
linkage: '{' markers ',' samples ',' genotypes '}' ;
markers: MARKERS ':' '[' marker_list ']';
marker_list: marker | marker_list ',' marker ;
marker: '{'
ID ':' INTEGER ','
NAME ':' STRING ','
CHROM ':' STRING ','
POSITION ':' INTEGER
'}';
samples: SAMPLES ':' '[' sample_list ']';
sample_list: sample | sample_list ',' sample ;
sample: '{'
ID ':' INTEGER ','
NAME ':' STRING ','
FATHER_ID ':' INTEGER ','
MOTHER_ID ':' INTEGER ','
ILLNESS ':' BOOLEAN
'}';
genotypes: GENOTYPES ':' '[' genotype_list ']';
genotype_list: genotype | genotype_list ',' genotype;
genotype: '{'
SAMPLE ':' INTEGER ','
MARKER ':' INTEGER ','
ALLELE_1 ':' STRING ','
ALLELE_2 ':' STRING
'}';
%%
int main(int argc,char** argv)
{
yyparse();
}
In this file the %token declares the keywords that will be accepted by the scanner. This grammer %starts with the linkage rule. This rule starst with a parenthesis followed by a 'markers' rule, followed by a comma, followed by a 'samples' rule, followed by a comma, followed by a 'genotypes' rule, followed by a parenthesis.#include <stdio.h>
int yywrap() { return 1;}
void yyerror(const char* s) {fprintf(stderr,"Error:%s.\n",s);}
%}
%token ALLELE_1 ALLELE_2 CHROM FATHER_ID GENOTYPES ID ILLNESS MARKER MARKERS MOTHER_ID NAME POSITION SAMPLE SAMPLES
%token BOOLEAN INTEGER STRING
%start linkage
%%
linkage: '{' markers ',' samples ',' genotypes '}' ;
markers: MARKERS ':' '[' marker_list ']';
marker_list: marker | marker_list ',' marker ;
marker: '{'
ID ':' INTEGER ','
NAME ':' STRING ','
CHROM ':' STRING ','
POSITION ':' INTEGER
'}';
samples: SAMPLES ':' '[' sample_list ']';
sample_list: sample | sample_list ',' sample ;
sample: '{'
ID ':' INTEGER ','
NAME ':' STRING ','
FATHER_ID ':' INTEGER ','
MOTHER_ID ':' INTEGER ','
ILLNESS ':' BOOLEAN
'}';
genotypes: GENOTYPES ':' '[' genotype_list ']';
genotype_list: genotype | genotype_list ',' genotype;
genotype: '{'
SAMPLE ':' INTEGER ','
MARKER ':' INTEGER ','
ALLELE_1 ':' STRING ','
ALLELE_2 ':' STRING
'}';
%%
int main(int argc,char** argv)
{
yyparse();
}
The 'markers' rule says that this rule is a 'marker_list' into a pair of brackets.
A marker_list is a marker or a marker_list (recursive rule) followed by a comma and another marker.
A marker is a set of JSON key/value. (Here, for simplicity, I expect that all the fields will appear in a given order)
etc...
To convert this file into a C source and a C header:
bison -d sample.y
sample.l: the Lexer
Here the lexer is basically a set of ordered regular expressions that will return a semantic identifier (e.g.BOOLEAN, INTEGER, MARKERS,...) about the tokens found in the input. Those identifiers were declared in a C header by the scanner.
%{
#include <stdio.h>
#include "sample.tab.h"/* generated by the scanner */
%}
%%
"\"allele-1\"" return ALLELE_1;
"\"allele-2\"" return ALLELE_2;
"\"chrom\"" return CHROM;
"\"father-id\"" return FATHER_ID;
"\"mother-id\"" return MOTHER_ID;
"\"id\"" return ID;
"\"markers\"" return MARKERS;
"\"marker\"" return MARKER;
"\"illness\"" return ILLNESS;
"\"genotypes\"" return GENOTYPES;
"\"name\"" return NAME;
"\"position\"" return POSITION;
"\"samples\"" return SAMPLES;
"\"sample\"" return SAMPLE;
true return BOOLEAN;
false return BOOLEAN;
[0-9]+ return INTEGER;
\"[^\"]*\" return STRING;/* a very simple string without escapes... */
[ \n\t\r] ;/* ignore */
. return yytext[0];
%%
#include <stdio.h>
#include "sample.tab.h"/* generated by the scanner */
%}
%%
"\"allele-1\"" return ALLELE_1;
"\"allele-2\"" return ALLELE_2;
"\"chrom\"" return CHROM;
"\"father-id\"" return FATHER_ID;
"\"mother-id\"" return MOTHER_ID;
"\"id\"" return ID;
"\"markers\"" return MARKERS;
"\"marker\"" return MARKER;
"\"illness\"" return ILLNESS;
"\"genotypes\"" return GENOTYPES;
"\"name\"" return NAME;
"\"position\"" return POSITION;
"\"samples\"" return SAMPLES;
"\"sample\"" return SAMPLE;
true return BOOLEAN;
false return BOOLEAN;
[0-9]+ return INTEGER;
\"[^\"]*\" return STRING;/* a very simple string without escapes... */
[ \n\t\r] ;/* ignore */
. return yytext[0];
%%
To convert this file into a C source :
flex sample.l
Compilation
gcc -o validate sample.tab.c lex.yy.c
Testing
cat sample.json | ./validate
echo "Hello"| ./validate
Error: Syntax error
echo "Hello"| ./validate
Error: Syntax error
That's it
Pierre
2 comments:
It seems that a JSON schema is in the works now.
While XML has been utilized in multiple languages, JSON is made specifically for JavaScript. In fact, I found a validation tool written in JavaScript (no schemas yet, but still cool).
Do you know of any efforts to create a JSON standard for genomic data? I was hoping to use it for a research project I am doing this summer.
@Evan,
no I think that most efforts will focus on RDF rather than on JSON.
Post a Comment