01 July 2008

Parsing JSON with javacc my Notebook.

Although I didn't wrote a new great programming language, I had a little experience with C lexers/parsers especialy with lex/yacc|flex/bison , a LALR parser. Now I'm programming in Java, I've been looking for the parsers available for this language: the most popular tools seems to be the top-down parsers javacc and antlr. In this post I show how I wrote a simple javacc parser reading a JSON entry (this is just an exercice, it is also easy to write this kind of parser using java.io.StreamTokenizer or java.util.Scanner).
First of all I found that the documentation was very limited and unlike Bison, I had the feeling that the javacc tutorial was a kind of "Look a the examples, isn't it cool ?"? I just writing my notebook here, so on my side I won't explain how I think I've understand how javacc works :-).
OK now, let's go back, to our JSON grammar and to the content of the javacc file.

The file is called JSONHandler.jj, it contains a java class JSONHandler with a main calling a method object after reading the input from stdin. This method will parse the json stream , transform it into a java object and echo it on stderr. The class is delimited by the keywords PARSER_BEGIN and PARSER_END


PARSER_BEGIN(JSONHandler)
public class JSONHandler {
public static void main(String args[])
{
try
{
JSONHandler parser = new JSONHandler(System.in);
Object o=parser.object();
System.err.println("JAVA OBJECT: "+o);
} catch(Exception err)
{
err.printStackTrace();
}
}
}
PARSER_END(JSONHandler)


Next we declare that the blank characters will be ignored:
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}



We then declare the lexical tokens (numbers, string, quoted strings) using a BNF grammar. As an example, here, a SIMPLE_QUOTE_LITERAL starts and ends with "\'" , it contains an unlimited number of ("escaped C special characters" or "normal characters").

TOKEN : /* LITERALS */
{
<#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <#SIGN: ["-","+"]>
| <#EXPONENT: ("E"|"e") (<SIGN>)? (<DIGIT>)+
>
| <FLOATING_NUMBER: (<DIGIT>)* "." (<DIGIT>)* (<EXPO
NENT>)?
| (<DIGIT>)+ (<EXPONENT>) >
| <INT_NUMBER: (<DIGIT>)+ >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>|<->)* >
| <#ESCAPE_CHAR: "\\" ["n","t","b","r","f","\\","'","\""]
>
| <SIMPLE_QUOTE_LITERAL:
"\'"
( (~["\'","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\'"
>
|
<DOUBLE_QUOTE_LITERAL:
"\""
( (~["\"","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\""
>
}


As we saw, the parser starts is job by invoking the object method. We expect here a JSON array or a JSON object or another identifier (null ||false||true||string). Those choices store their results in the variable o. This method returns a java.lang.Object.

public Object object():
{Object o;}
{
(
o=array()
| o= map()
| o= identifier()
)
{return o;}
}


JSON arrays will be returned as java.util.Vector<Object>. A JSON array is identified as starting by "[" and ending with "]", it contains zero or more JSON objects separated with a comma. Each time an element of this array is found, it is added in the vector. at the end, the vector is returned as the value of this array.
public Vector<Object> array():
{Vector<Object> vector= new Vector<Object>(); Object o;}
{
"[" ( o= object() {vector.addElement(o);} ("," o=object() {vector.addElement(o);} ) * )? "]"
{
return vector;
}
}


A JSON identifier can be a number, a string, a quoted string (need to be un-escaped), null, true or false. The content of this lexical token is obtained via the Token object which class is generated by javacc.

public Object identifier():
{Token t;}
{
(
t=<FLOATING_NUMBER>
{
return new Double(t.image);
}
| t=<INT_NUMBER>
{
return new Long(t.image);
}
| t=<IDENTIFIER>
{
if(t.image.equals("true"))
{
return Boolean.TRUE;
}
else if(t.image.equals("false"))
{
return Boolean.FALSE;
}
else if(t.image.equals("null"))
{
return null;
}
return t.image;
}
| t=<SIMPLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
| t=<DOUBLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
)
}


JSON Object will be returned as java.util.HashMap. A JSON Object starts with '{' and ends with '}'. In this case, we're passing this map as an argument each time the parser finds a pair of key/value.

public HashMap<String,Object> map():
{HashMap<String,Object> map= new HashMap<String,Object>(); }
{
"{" ( keyValue(map) ("," keyValue(map))*)? "}"
{
return map;
}
}

public void keyValue( HashMap<String,Object> map):
{Object k; Object v;}
{
(k=identifier() ":" v=object())
{
if(k==null) throw new ParseException("null cannot be used as key in object");
if(!(k instanceof String)) throw new ParseException(k.toString()+"("+k.getClass()+") cannot
be used as key in object");
map.put(k.toString(),v);
}
}


Compilation:
javacc JSONHandler.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file JSONHandler.jj . . .
Parser generated successfully.
javac JSONHandler.java


Testing:
Here is the content of test.json
{
organisms:[
{
id:10929,
name:"Bovine Rotavirus"
},
{
id:9606,
name:"Homo Sapiens"
}
],
proteins:[
{
label:"NSP3",
description:"Rotavirus Non Structural Protein 3",
organism-id: 10929,
acc: "ACB38353"
},
{
label:"EIF4G",
description:"eukaryotic translation initiation factor 4 gamma",
organism-id: 9606,
acc:"AAI40897"
}
],
interactions:[
{
label:"NSP3 interacts with EIF4G1",
pubmed-id:[77120248,38201627],
proteins:["ACB38353","AAI40897"]
}
]
}


This file is parsed, a *java* object is returned and printed on screen.
java JSONHandler < test.json
JAVA OBJECT: {organisms=[{id=10929, name=Bovine Rotavirus}, {id=9606, name=Homo Sapiens}],
proteins=[{description=Rotavirus Non Structural Protein 3, organism-id=10929, label=NSP3,
acc=ACB38353}, {description=eukaryotic translation initiation factor 4 gamma,
organism-id=9606, label=EIF4G, acc=AAI40897}], interactions=[{pubmed-id=[77120248, 38201627],
label=NSP3 interacts with EIF4G1, proteins=[ACB38353, AAI40897]}]}


Here is the complete javacc source code:
PARSER_BEGIN(JSONHandler)
import java.util.Vector;
import java.util.HashMap;

public class JSONHandler {



public static void main(String args[])
{
try
{
JSONHandler parser = new JSONHandler(System.in);
Object o=parser.object();
System.err.println("JAVA OBJECT: "+o);
} catch(Exception err)
{
err.printStackTrace();
}
}

/** unescape a C string */
private static String unescape(String s)
{
StringBuilder sb= new StringBuilder(s.length());
for(int i=1;i< s.length()-1;++i)
{
if(s.charAt(i)=='\\')
{
if(i+1< s.length()-1)
{
++i;
switch(s.charAt(i))
{
case '\n': sb.append('\n'); break;
case '\r': sb.append('\r'); break;
case '\\': sb.append('\\'); break;
case 'b': sb.append('\b'); break;
case 't': sb.append('\t'); break;
case 'f': sb.append('\f'); break;
case '\'': sb.append('\''); break;
case '\"': sb.append('\"'); break;
default: sb.append(s.charAt(i));
}
}
}
else
{
sb.append(s.charAt(i));
}
}
return sb.toString();
}

}

PARSER_END(JSONHandler)

SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}


TOKEN : /* LITERALS */
{
<#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <#SIGN: ["-","+"]>
| <#EXPONENT: ("E"|"e") (<SIGN>)? (<DIGIT>)+ >
| <FLOATING_NUMBER: (<DIGIT>)* "." (<DIGIT>)* (<EXPONENT>)?
| (<DIGIT>)+ (<EXPONENT>) >
| <INT_NUMBER: (<DIGIT>)+ >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>|"-")* >
| <#ESCAPE_CHAR: "\\" ["n","t","b","r","f","\\","'","\""] >
| <SIMPLE_QUOTE_LITERAL:
"\'"
( (~["\'","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\'"
>
|
<DOUBLE_QUOTE_LITERAL:
"\""
( (~["\"","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\""
>
}



public Object object():
{Object o;}
{
(
o=array()
| o= map()
| o= identifier()
)
{return o;}
}

public Object identifier():
{Token t;}
{
(
t=<FLOATING_NUMBER>
{
return new Double(t.image);
}
| t=<INT_NUMBER>
{
return new Long(t.image);
}
| t=<IDENTIFIER>
{
if(t.image.equals("true"))
{
return Boolean.TRUE;
}
else if(t.image.equals("false"))
{
return Boolean.FALSE;
}
else if(t.image.equals("null"))
{
return null;
}
return t.image;
}
| t=<SIMPLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
| t=<DOUBLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
)
}

public Vector<Object> array():
{Vector<Object> vector= new Vector<Object>(); Object o;}
{
"[" ( o=object() {vector.addElement(o);} ("," o=object() {vector.addElement(o);} ) * )? "]"
{
return vector;
}
}

public HashMap<String,Object> map():
{HashMap<String,Object> map= new HashMap<String,Object>(); }
{
"{" ( keyValue(map) ("," keyValue(map))*)? "}"
{
return map;
}
}

public void keyValue( HashMap<String,Object> map):
{Object k; Object v;}
{
(k=identifier() ":" v=object())
{
if(k==null) throw new ParseException("null cannot be used as key in object");
if(!(k instanceof String)) throw new ParseException(k.toString()+"("+k.getClass()+") cannot be used as key in object");
map.put(k.toString(),v);
}
}



That's it. What's next ? jjtree is another component of the javacc package and seems to be a promising tool: it builds a tree structure from the grammar. The nodes of this tree can then be visited just like in a DOM/XML document and a language can be implemented, but here again, the documentation is succinct.

9 comments:

  1. Pierre, you can better use List<Object> and Map<Object> as return values, being interfaces instead of implementations, and if you don't need threading in the parsing, ArrayList is faster than Vector.

    ReplyDelete
  2. Dear Pierre,

    I found your post and the information very usefull. I was writing a small JSon lib that I would like to publish, and I would like to include a (slightly modified) version of your JavaCC Grammer as a parser. So I would like to know: May I resuse your grammer? If so, under which license?

    ReplyDelete
  3. Hi Max,
    Feel free to use this code in any way you want !:-) Please, just add a reference to me (Pierre Lindenbaum plindenbaum yahoo fr ) and to this post in the source and/or the README.

    ReplyDelete
  4. Pierre,

    I've published the library at:

    http://max.berger.name/oss/mjl

    and reference you in the source file and as contributor. Thank you very much for your post!

    Max

    ReplyDelete
  5. No problem Max, thank you for using this small code.
    Pierre

    ReplyDelete
  6. i have a similar grammer setup and i'm having some issues. do you mind helping me out? all of my source code is at:

    https://sourceforge.net/projects/javajson/

    and i've created a fairly detailed bug with test case here:

    https://sourceforge.net/tracker/?func=detail&aid=2733371&group_id=162311&atid=823285

    the test case is actually committed into the test folder.

    ReplyDelete
  7. max, i've implemented a json library. you should take a look at it before spending a lot of time on it. maybe you can help me out with it.

    https://sourceforge.net/projects/javajson/

    ReplyDelete
  8. Very useful example for using javacc in real life.

    Thank you a lot, and hoping to see more about this topic.

    ReplyDelete