Although I didn't wrote a new great programming language, I had a little experience with C lexers/parsers especialy with lex/yacc|flex/bison , a LALR parser. Now I'm programming in Java, I've been looking for the parsers available for this language: the most popular tools seems to be the top-down parsers javacc and antlr. In this post I show how I wrote a simple javacc parser reading a JSON entry (this is just an exercice, it is also easy to write this kind of parser using java.io.StreamTokenizer or java.util.Scanner).
First of all I found that the documentation was very limited and unlike Bison, I had the feeling that the javacc tutorial was a kind of "Look a the examples, isn't it cool ?"? I just writing my notebook here, so on my side I won't explain how I think I've understand how javacc works :-).
OK now, let's go back, to our JSON grammar and to the content of the javacc file.
The file is called JSONHandler.jj, it contains a java class JSONHandler with a main
calling a method object
after reading the input from stdin. This method will parse the json stream , transform it into a java object and echo it on stderr. The class is delimited by the keywords PARSER_BEGIN and PARSER_END
PARSER_BEGIN(JSONHandler)
public class JSONHandler {
public static void main(String args[])
{
try
{
JSONHandler parser = new JSONHandler(System.in);
Object o=parser.object();
System.err.println("JAVA OBJECT: "+o);
} catch(Exception err)
{
err.printStackTrace();
}
}
}
PARSER_END(JSONHandler)
Next we declare that the blank characters will be ignored:
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
We then declare the lexical tokens (numbers, string, quoted strings) using a BNF grammar. As an example, here, a
SIMPLE_QUOTE_LITERAL
starts and ends with "\'" , it contains an unlimited number of ("escaped C special characters" or "normal characters").
TOKEN : /* LITERALS */
{
<#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <#SIGN: ["-","+"]>
| <#EXPONENT: ("E"|"e") (<SIGN>)? (<DIGIT>)+
>
| <FLOATING_NUMBER: (<DIGIT>)* "." (<DIGIT>)* (<EXPO
NENT>)?
| (<DIGIT>)+ (<EXPONENT>) >
| <INT_NUMBER: (<DIGIT>)+ >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>|<->)* >
| <#ESCAPE_CHAR: "\\" ["n","t","b","r","f","\\","'","\""]
>
| <SIMPLE_QUOTE_LITERAL:
"\'"
( (~["\'","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\'"
>
|
<DOUBLE_QUOTE_LITERAL:
"\""
( (~["\"","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\""
>
}
As we saw, the parser starts is job by invoking the
object method. We expect here a JSON array or a JSON object or another identifier (null ||false||true||string). Those choices store their results in the variable
o
. This method returns a
java.lang.Object
.
public Object object():
{Object o;}
{
(
o=array()
| o= map()
| o= identifier()
)
{return o;}
}
JSON arrays will be returned as
java.util.Vector<Object>. A JSON array is identified as starting by "[" and ending with "]", it contains zero or more JSON
objects separated with a comma. Each time an element of this array is found, it is added in the vector. at the end, the vector is returned as the value of this array.
public Vector<Object> array():
{Vector<Object> vector= new Vector<Object>(); Object o;}
{
"[" ( o= object() {vector.addElement(o);} ("," o=object() {vector.addElement(o);} ) * )? "]"
{
return vector;
}
}
A JSON identifier can be a number, a string, a quoted string (need to be un-escaped), null, true or false. The content of this lexical token is obtained via the
Token object which class is generated by
javacc.
public Object identifier():
{Token t;}
{
(
t=<FLOATING_NUMBER>
{
return new Double(t.image);
}
| t=<INT_NUMBER>
{
return new Long(t.image);
}
| t=<IDENTIFIER>
{
if(t.image.equals("true"))
{
return Boolean.TRUE;
}
else if(t.image.equals("false"))
{
return Boolean.FALSE;
}
else if(t.image.equals("null"))
{
return null;
}
return t.image;
}
| t=<SIMPLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
| t=<DOUBLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
)
}
JSON Object will be returned as
java.util.HashMap. A JSON Object starts with '{' and ends with '}'. In this case, we're passing this map as an argument each time the parser finds a pair of key/value.
public HashMap<String,Object> map():
{HashMap<String,Object> map= new HashMap<String,Object>(); }
{
"{" ( keyValue(map) ("," keyValue(map))*)? "}"
{
return map;
}
}
public void keyValue( HashMap<String,Object> map):
{Object k; Object v;}
{
(k=identifier() ":" v=object())
{
if(k==null) throw new ParseException("null cannot be used as key in object");
if(!(k instanceof String)) throw new ParseException(k.toString()+"("+k.getClass()+") cannot
be used as key in object");
map.put(k.toString(),v);
}
}
Compilation:
javacc JSONHandler.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file JSONHandler.jj . . .
Parser generated successfully.
javac JSONHandler.java
Testing:
Here is the content of
test.json{
organisms:[
{
id:10929,
name:"Bovine Rotavirus"
},
{
id:9606,
name:"Homo Sapiens"
}
],
proteins:[
{
label:"NSP3",
description:"Rotavirus Non Structural Protein 3",
organism-id: 10929,
acc: "ACB38353"
},
{
label:"EIF4G",
description:"eukaryotic translation initiation factor 4 gamma",
organism-id: 9606,
acc:"AAI40897"
}
],
interactions:[
{
label:"NSP3 interacts with EIF4G1",
pubmed-id:[77120248,38201627],
proteins:["ACB38353","AAI40897"]
}
]
}
This file is parsed, a *java* object is returned and printed on screen.
java JSONHandler < test.json
JAVA OBJECT: {organisms=[{id=10929, name=Bovine Rotavirus}, {id=9606, name=Homo Sapiens}],
proteins=[{description=Rotavirus Non Structural Protein 3, organism-id=10929, label=NSP3,
acc=ACB38353}, {description=eukaryotic translation initiation factor 4 gamma,
organism-id=9606, label=EIF4G, acc=AAI40897}], interactions=[{pubmed-id=[77120248, 38201627],
label=NSP3 interacts with EIF4G1, proteins=[ACB38353, AAI40897]}]}
Here is the complete javacc source code:
PARSER_BEGIN(JSONHandler)
import java.util.Vector;
import java.util.HashMap;
public class JSONHandler {
public static void main(String args[])
{
try
{
JSONHandler parser = new JSONHandler(System.in);
Object o=parser.object();
System.err.println("JAVA OBJECT: "+o);
} catch(Exception err)
{
err.printStackTrace();
}
}
/** unescape a C string */
private static String unescape(String s)
{
StringBuilder sb= new StringBuilder(s.length());
for(int i=1;i< s.length()-1;++i)
{
if(s.charAt(i)=='\\')
{
if(i+1< s.length()-1)
{
++i;
switch(s.charAt(i))
{
case '\n': sb.append('\n'); break;
case '\r': sb.append('\r'); break;
case '\\': sb.append('\\'); break;
case 'b': sb.append('\b'); break;
case 't': sb.append('\t'); break;
case 'f': sb.append('\f'); break;
case '\'': sb.append('\''); break;
case '\"': sb.append('\"'); break;
default: sb.append(s.charAt(i));
}
}
}
else
{
sb.append(s.charAt(i));
}
}
return sb.toString();
}
}
PARSER_END(JSONHandler)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
TOKEN : /* LITERALS */
{
<#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <#SIGN: ["-","+"]>
| <#EXPONENT: ("E"|"e") (<SIGN>)? (<DIGIT>)+ >
| <FLOATING_NUMBER: (<DIGIT>)* "." (<DIGIT>)* (<EXPONENT>)?
| (<DIGIT>)+ (<EXPONENT>) >
| <INT_NUMBER: (<DIGIT>)+ >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>|"-")* >
| <#ESCAPE_CHAR: "\\" ["n","t","b","r","f","\\","'","\""] >
| <SIMPLE_QUOTE_LITERAL:
"\'"
( (~["\'","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\'"
>
|
<DOUBLE_QUOTE_LITERAL:
"\""
( (~["\"","\\","\n","\r"])
| <ESCAPE_CHAR>
)*
"\""
>
}
public Object object():
{Object o;}
{
(
o=array()
| o= map()
| o= identifier()
)
{return o;}
}
public Object identifier():
{Token t;}
{
(
t=<FLOATING_NUMBER>
{
return new Double(t.image);
}
| t=<INT_NUMBER>
{
return new Long(t.image);
}
| t=<IDENTIFIER>
{
if(t.image.equals("true"))
{
return Boolean.TRUE;
}
else if(t.image.equals("false"))
{
return Boolean.FALSE;
}
else if(t.image.equals("null"))
{
return null;
}
return t.image;
}
| t=<SIMPLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
| t=<DOUBLE_QUOTE_LITERAL>
{
return unescape(t.image);
}
)
}
public Vector<Object> array():
{Vector<Object> vector= new Vector<Object>(); Object o;}
{
"[" ( o=object() {vector.addElement(o);} ("," o=object() {vector.addElement(o);} ) * )? "]"
{
return vector;
}
}
public HashMap<String,Object> map():
{HashMap<String,Object> map= new HashMap<String,Object>(); }
{
"{" ( keyValue(map) ("," keyValue(map))*)? "}"
{
return map;
}
}
public void keyValue( HashMap<String,Object> map):
{Object k; Object v;}
{
(k=identifier() ":" v=object())
{
if(k==null) throw new ParseException("null cannot be used as key in object");
if(!(k instanceof String)) throw new ParseException(k.toString()+"("+k.getClass()+") cannot be used as key in object");
map.put(k.toString(),v);
}
}
That's it. What's next ?
jjtree is another component of the javacc package and seems to be a promising tool: it builds a tree structure from the grammar. The nodes of this tree can then be visited just like in a DOM/XML document and a language can be implemented, but here again, the documentation is succinct.