YOKOFAKUN: Validating JSON with lex & yac

17 December 2008

Validating JSON with lex & yac

In a recent post on Twitter, Chris Lasher/agbiotec said:
I did a quick Google for "JSON schema" and "JSON validation"; looks like there's nothing in plac e yet like XML schema..
I suggested that lex/yacc could be used to create a trivial tool for this kind of validation. Here is an example.
Say you have a linkage file expressed as JSON. This file contains some information about a set of genetic markers, a set of samples and some genotypes.

{
"markers":[
 {
  "id":1,
  "name":"rs1",
  "chrom":"chr1",
  "position":1
 },
 {
  "id":2,
  "name":"rs2",
  "chrom":"chr1",
  "position":2
 },
 {
  "id":3,
  "name":"rs3",
  "chrom":"chr1",
  "position":3
 }
 ],
"samples":[
 {
  "id":1,
  "name":"Individual1",
  "father-id":2,
  "mother-id":3,
  "illness":true
 },
 {
  "id":2,
  "name":"Individual2",
  "father-id":0,
  "mother-id":0,
  "illness":false
 },
 {
  "id":3,
  "name":"Individual3",
  "father-id":0,
  "mother-id":0,
  "illness":true
 }
],
"genotypes" : [

 {
  "sample":1,
  "marker":2,
  "allele-1":"A",
  "allele-2":"T"
 },
 {
  "sample":2,
  "marker":2,
  "allele-1":"A",
  "allele-2":"T"
 }

]
}

If we want to validate this file with lex/yacc or flex/bison. We need:

A Scanner generated by bison. This scanner contains the grammatical rules.

A Lexer generated by flex. This lexer transforms the input into a set of semantic tokens

sample.y: the Scanner

%{
#include <stdio.h>
int yywrap() { return 1;}
void yyerror(const char* s) {fprintf(stderr,"Error:%s.\n",s);}
%}


%token ALLELE_1 ALLELE_2 CHROM FATHER_ID GENOTYPES ID ILLNESS MARKER MARKERS MOTHER_ID NAME POSITION SAMPLE SAMPLES
%token BOOLEAN INTEGER STRING
%start linkage
%%

linkage: '{' markers ','  samples ',' genotypes '}' ;

markers: MARKERS ':' '[' marker_list ']';
marker_list: marker | marker_list ',' marker ;
marker: '{'
         ID ':' INTEGER ','
         NAME ':' STRING ','
         CHROM ':' STRING ','
         POSITION ':' INTEGER
        '}';



samples: SAMPLES ':' '[' sample_list ']';
sample_list: sample | sample_list ',' sample ;
sample: '{'
         ID ':' INTEGER ','
         NAME ':' STRING ','
         FATHER_ID ':' INTEGER ','
         MOTHER_ID ':' INTEGER ','
         ILLNESS ':' BOOLEAN
        '}';


genotypes: GENOTYPES ':' '[' genotype_list ']';
genotype_list: genotype | genotype_list ',' genotype;
genotype: '{'
         SAMPLE ':' INTEGER ','
         MARKER ':' INTEGER ','
         ALLELE_1 ':' STRING ','
         ALLELE_2 ':' STRING
        '}';
%%

int main(int argc,char** argv)
{
yyparse();
}

In this file the %token declares the keywords that will be accepted by the scanner. This grammer %starts with the linkage rule. This rule starst with a parenthesis followed by a 'markers' rule, followed by a comma, followed by a 'samples' rule, followed by a comma, followed by a 'genotypes' rule, followed by a parenthesis.
The 'markers' rule says that this rule is a 'marker_list' into a pair of brackets.
A marker_list is a marker or a marker_list (recursive rule) followed by a comma and another marker.
A marker is a set of JSON key/value. (Here, for simplicity, I expect that all the fields will appear in a given order)
etc...

To convert this file into a C source and a C header:

bison -d sample.y

sample.l: the Lexer

Here the lexer is basically a set of ordered regular expressions that will return a semantic identifier (e.g.BOOLEAN, INTEGER, MARKERS,...) about the tokens found in the input. Those identifiers were declared in a C header by the scanner.

%{
#include <stdio.h>
#include "sample.tab.h"/* generated by the scanner */
%}
%%
"\"allele-1\""      return ALLELE_1;
"\"allele-2\""      return ALLELE_2;
"\"chrom\"" return CHROM;
"\"father-id\"" return FATHER_ID;
"\"mother-id\"" return MOTHER_ID;
"\"id\""    return ID;
"\"markers\""       return MARKERS;
"\"marker\""        return MARKER;
"\"illness\""       return ILLNESS;
"\"genotypes\"" return GENOTYPES;
"\"name\""  return NAME;
"\"position\""      return POSITION;
"\"samples\""       return SAMPLES;
"\"sample\""        return SAMPLE;
true            return BOOLEAN;
false           return BOOLEAN;
[0-9]+          return INTEGER;
\"[^\"]*\"       return STRING;/* a very simple string without escapes... */
[ \n\t\r]       ;/* ignore */
.               return yytext[0];
%%

To convert this file into a C source :

flex sample.l

Compilation

gcc -o validate sample.tab.c lex.yy.c

Testing

cat sample.json | ./validate
echo "Hello"| ./validate
Error: Syntax error

That's it

Pierre

2 comments:

Anonymous said...: It seems that a JSON schema is in the works now.

While XML has been utilized in multiple languages, JSON is made specifically for JavaScript. In fact, I found a validation tool written in JavaScript (no schemas yet, but still cool).

Do you know of any efforts to create a JSON standard for genomic data? I was hoping to use it for a research project I am doing this summer.; Friday, 19 June, 2009
Pierre Lindenbaum said...: @Evan,
no I think that most efforts will focus on RDF rather than on JSON.; Friday, 19 June, 2009

YOKOFAKUN

17 December 2008

Validating JSON with lex & yac

sample.y: the Scanner

sample.l: the Lexer

Compilation

Testing

2 comments:

About Me

Feeds

Blog Archive

Web2.0

Labels