YOKOFAKUN: javacc

Showing posts with label javacc. Show all posts

30 March 2011

Parsing a genomic position with javacc

Parsing a genomic position (chrom:start-end) is an easy task but I've always been too lazy to create a library for this. Today I wrote a Java-CC-based parser for analyzing the various syntaxes of a genomic position. Here is the grammar I used:

COMMA: ","
LETTER: (["a"-"z"]|["A"-"Z"]|"_") ;
DIGIT: ["0"-"9"];
INT:<DIGIT> ( (<DIGIT>|<COMMA>)* <DIGIT>)? ;
BP: "b" ("p")? ;
KB: ("k") ("B")? ;
MB: ("m") ("B")? ;
GB: ("g") ("B")? ;
IDENTIFIER: <LETTER> (<DIGIT>|<LETTER>)* ;
COLON: ":" ;
DASH: "-" ;
PLUS: "+" ;
DELIM: ("|"|";") ;

java.util.List<Segment> many(): segment() ((<DELIM>)? segment() )* )? <EOF>);
Segment one(): segment() <EOF>;
Segment segment(): chromName() <COLON> position() (<DASH> position()| <PLUS> position())? );
BigInteger position():integer() (factor())?;
BigInteger factor(): ( <BP> | <KB>| <MB> | <GB> );
BigInteger integer():<INT> ;
String chromName():( integer() | identifier());
String identifier(): <IDENTIFIER> ;

Source code

Compiling

javacc SegmentParser.jj
javac SegmentParser.java

Running

echo " chrM:1-100,000"| java SegmentParser
chrM:1-100000
echo " c1:1000"| java SegmentParser
c1:1000-1001
echo "2:1Gb+1 " | java SegmentParser
chr2:999999999-1000000002
echo "chr2:10+100" | java SegmentParser
ParseException: -90 < 0)
echo "chrX:3147483647" | java SegmentParser
ParseException: 3147483647 > 2147483647 (int-max)
echo "2:1Gb+a azd " | java SegmentParser
ParseException: Encountered "a" at line 1, column 7

That's it,

Pierre

09 November 2009

Building a simple Expression language with JJTree/Javacc . My Notebook

Last year, I described how to use the JAVACC a parser/scanner generator for java. This WE, I've played with JJTREE: JJTree is a preprocessor for JavaCC that inserts parse tree building actions at various places in the JavaCC source.. Here I describe how to build a simple expression language to find an object in a simple 'JSON 'object only built with arrays, java.util.List, java.util.Map, String, etc... . For example: exectuting a[5].year on {"d":12345, "b":null, "c":false, "a":["Hello", "World", true, false, null, {"name":"Pierre", "year":2009}]} would return: 2009

The JSONPath.jjt header

options {
/** create a pure parser */
STATIC=false;
DEBUG_PARSER=false;
ERROR_REPORTING=true;
NODE_USES_PARSER=false;
/** this class will build the nodes, must implement:"static SimpleNode JSONPathCompiler.jjtCreate(int id)" */
NODE_FACTORY="JSONPathCompiler";
/** The node of the Abstract Syntax Tree will extend this class */
NODE_CLASS="JSONNode";
}

The tokens

Defines the regular expressions to be recognized by the lexer.

TOKEN: {
<#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <DOT: ".">
| <INTEGER: <DIGIT> (<DIGIT>)* >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)*>
| <OPEN_ARRAY: "[">
| <CLOSE_ARRAY: "]">
}

The grammar

An expression is at least an 'object' or an 'array' followed by any number of 'objects' or 'arrays'. '#EXPRESSION' means that we want this node to be named 'JJTEXPRESSION' instead of the default, 'JJTexpr'. This expression with return a JSONNode.

JSONNode expr() #EXPRESSION:{}
{
(array() | object() ) ( array() | <DOT> object() )* <EOF>
{
/** jjtThis is the current AST node */
return jjtThis;
}
}

An array is an integer surrounded by two brackets. The AST node created by Javacc contains an object called 'value' holding whatever your want (here, the value of the index ).

void array() #ARRAY:{Token i;}
{
<OPEN_ARRAY> i=<INTEGER> <CLOSE_ARRAY>
{
jjtThis.value=new Integer(i.image);
}
}

An object/map is just a name.

void object() #OBJECT: { Token name;}
{
( name=<IDENTIFIER> )
{
jjtThis.value= name.image;
}
}

The Custom AST node

By default, all the node are an instance of the class SimpleNode generated by JJTree. Here, I've choosen to use a custom node extending SimpleNode.

public static class JSONNode
extends SimpleNode
{
JSONNode(int i)
{
super(i);
}
(...)
}

static SimpleNode jjtCreate(int id)
{
return new JSONNode(id);
}
(...)

Implementing the expression language

The method 'eval' will implement the expression language itself. For example, a node of type "JJTEXPRESSION:" will get the values of its children, a node of type "JJTARRAY" will test if the object is a list or an array and will use the index stored in value to retrieve the indexed data.

public Object eval(Object o)
{
if(o==null) return null;
switch(this.id)
{
case JJTEXPRESSION:
{
if(getChildrenCount()==0) return null;
for(int i=0;o!=null && i< getChildrenCount();++i)
{
o=at(i).eval(o);
}
return o;
}
case JJTARRAY:
{
int index= Integer.class.cast(this.value);
if(o instanceof java.util.List)
{
java.util.List L=java.util.List.class.cast(o);
if(index<0 || index>= L.size()) return null;
return L.get(index);
}
else if(o.getClass().isArray())
{
Object L[]=(Object[])o;
if(index<0 || index>= L.length) return null;
return L[index];
}
return null;
}
case JJTOBJECT:
{
if(o instanceof java.util.Map)
{
java.util.Map M=java.util.Map.class.cast(o);
return M.get(this.value);
}
return null;
}

default:return null;
}
}

Compiling Testing

$(JCC)/jjtree JSONPath.jjt
$(JCC)/javacc JSONPath.jj
javac JSONPathCompiler.java
java JSONPathCompiler <expression>

Results

Compiling a[5]
Eval {"d":12345, "b":null, "c":false, "a":["Hello", "World", true, false, null, {"name":"Pierre", "year":2009}]}
Result is:{name=Pierre, year=2009}

Compiling a[1]
Eval {"d":12345, "b":null, "c":false, "a":["Hello", "World", true, false, null, {"name":"Pierre", "year":2009}]}
Result is:World

Compiling a[1][3][0]
Eval {"d":12345, "b":null, "c":false, "a":["Hello", "World", true, false, null, {"name":"Pierre", "year":2009}]}
Result is:null

Compiling d
Eval {"d":12345, "b":null, "c":false, "a":["Hello", "World", true, false, null, {"name":"Pierre", "year":2009}]}
Result is:12345

Compiling a[5].year
Eval {"d":12345, "b":null, "c":false, "a":["Hello", "World", true, false, null, {"name":"Pierre", "year":2009}]}
Result is:2009

Compiling x.y
Eval {"d":12345, "b":null, "c":false, "a":["Hello", "World", true, false, null, {"name":"Pierre", "year":2009}]}
Result is:null

Full source code

options {
STATIC=false;
DEBUG_PARSER=false;
ERROR_REPORTING=true;
NODE_USES_PARSER=false;
NODE_FACTORY="JSONPathCompiler";
NODE_CLASS="JSONNode";
}

PARSER_BEGIN(JSONPathCompiler)
import java.lang.reflect.*;

public class JSONPathCompiler
{
public static class JSONNode
extends SimpleNode
{
JSONNode(int i)
{
super(i);
}

public int getChildrenCount()
{
return this.children==null?0:children.length;
}

public JSONNode at(int index)
{
return (JSONNode)(
this.children==null ||
index<0 ||
index>=this.children.length ?
null:
this.children[index]
);
}

public String getName()
{
return JSONPathCompilerTreeConstants.jjtNodeName[this.id];
}

public void dump(String prefix)
{
System.out.println(prefix+toString());
if( children!=null && children.length>0)
{
System.out.println(prefix+" [");
for (int i = 0; children!=null && i < children.length; ++i)
{
System.out.print(prefix+" ("+(i+1)+")");
SimpleNode c= (SimpleNode)children[i];
if(c==null)
{
System.out.println("null");
}
else
{
c.dump(prefix+" ");
}
}
System.out.println(prefix+" ]");
}
}

public Object eval(Object o)
{
if(o==null) return null;
switch(this.id)
{
case JJTEXPRESSION:
{
if(getChildrenCount()==0) return null;
for(int i=0;o!=null && i< getChildrenCount();++i)
{
o=at(i).eval(o);
}
return o;
}
case JJTARRAY:
{
int index= Integer.class.cast(this.value);
if(o instanceof java.util.List)
{
java.util.List L=java.util.List.class.cast(o);
if(index<0 || index>= L.size()) return null;
return L.get(index);
}
else if(o.getClass().isArray())
{
Object L[]=(Object[])o;
if(index<0 || index>= L.length) return null;
return L[index];
}
return null;
}
case JJTOBJECT:
{
if(o instanceof java.util.Map)
{
java.util.Map M=java.util.Map.class.cast(o);
return M.get(this.value);
}
return null;
}

default:
System.err.println("Doesn't handle "+this.id);
break;
}
return null;
}
}

static SimpleNode jjtCreate(int id)
{
return new JSONNode(id);
}

public static void main(String args[])
{
try
{
java.io.StringReader r= new java.io.StringReader(args[0]);
JSONPathCompiler app= new JSONPathCompiler(r);
JSONNode x=JSONNode.class.cast(app.expr());
//x.dump("[xpath]");
java.util.Map<String,Object> o1=new java.util.HashMap<String,Object>();
java.util.HashMap o2=new java.util.HashMap<String,Object>();
java.util.List<Object> o3=new java.util.ArrayList<Object>();

o1.put("a",o3);
o1.put("b",null);
o1.put("c",Boolean.FALSE);
o1.put("d",12345);

o3.add("Hello");
o3.add("World");
o3.add(Boolean.TRUE);
o3.add(Boolean.FALSE);
o3.add(null);
o3.add(o2);

o2.put("name","Pierre");
o2.put("year",2009);

Object o=x.eval(o1);
System.err.println("Compiling <code>"+args[0]+"</code>");
System.err.println("Eval <code>"+o1+"</code>");
System.err.println("Result is:<code>"+o+"</code>");
}
catch(Exception err)
{
err.printStackTrace();
}
}
}
PARSER_END(JSONPathCompiler)

SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}

TOKEN: {
<#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <DOT: ".">
| <INTEGER: <DIGIT> (<DIGIT>)* >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)*>
| <OPEN_ARRAY: "[">
| <CLOSE_ARRAY: "]">
}

JSONNode expr() #EXPRESSION:{}
{
(array() | object() ) ( array() | <DOT> object() )* <EOF>
{
return jjtThis;
}
}

void array() #ARRAY:{Token i;}
{
<OPEN_ARRAY> i=<INTEGER> <CLOSE_ARRAY>
{
jjtThis.value=new Integer(i.image);
}
}

void object() #OBJECT: { Token name;}
{
( name=<IDENTIFIER> )
{
jjtThis.value= name.image;
}
}

That's it.
Pierre

01 July 2008

Parsing JSON with javacc my Notebook.

Although I didn't wrote a new great programming language, I had a little experience with C lexers/parsers especialy with lex/yacc|flex/bison , a LALR parser. Now I'm programming in Java, I've been looking for the parsers available for this language: the most popular tools seems to be the top-down parsers javacc and antlr. In this post I show how I wrote a simple javacc parser reading a JSON entry (this is just an exercice, it is also easy to write this kind of parser using java.io.StreamTokenizer or java.util.Scanner).
First of all I found that the documentation was very limited and unlike Bison, I had the feeling that the javacc tutorial was a kind of "Look a the examples, isn't it cool ?"? I just writing my notebook here, so on my side I won't explain how I think I've understand how javacc works :-).
OK now, let's go back, to our JSON grammar and to the content of the javacc file.

The file is called JSONHandler.jj, it contains a java class JSONHandler with a main calling a method object after reading the input from stdin. This method will parse the json stream , transform it into a java object and echo it on stderr. The class is delimited by the keywords PARSER_BEGIN and PARSER_END

PARSER_BEGIN(JSONHandler)
public class JSONHandler {
  public static void main(String args[])
   {
   try
   {
    JSONHandler parser = new JSONHandler(System.in);
    Object o=parser.object();
    System.err.println("JAVA OBJECT: "+o);
   } catch(Exception err)
   {
   err.printStackTrace();
   }
   }
}
PARSER_END(JSONHandler)

Next we declare that the blank characters will be ignored:

SKIP :
{
  " "
| "\t"
| "\n"
| "\r"
}

We then declare the lexical tokens (numbers, string, quoted strings) using a BNF grammar. As an example, here, a SIMPLE_QUOTE_LITERAL starts and ends with "\'" , it contains an unlimited number of ("escaped C special characters" or "normal characters").

TOKEN  : /* LITERALS */
{
  <#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <#SIGN: ["-","+"]>
| <#EXPONENT: ("E"|"e") (<SIGN>)? (<DIGIT>)+
>
| <FLOATING_NUMBER: (<DIGIT>)* "." (<DIGIT>)* (<EXPO
NENT>)?
        | (<DIGIT>)+ (<EXPONENT>) >
| <INT_NUMBER: (<DIGIT>)+ >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>|<->)* >
| <#ESCAPE_CHAR: "\\" ["n","t","b","r","f","\\","'","\""]
 >
| <SIMPLE_QUOTE_LITERAL:
  "\'"
      (   (~["\'","\\","\n","\r"])
        | <ESCAPE_CHAR>
      )*
      "\'"
  >
|
  <DOUBLE_QUOTE_LITERAL:
      "\""
      (   (~["\"","\\","\n","\r"])
       | <ESCAPE_CHAR>
      )*
      "\""
  >
}

As we saw, the parser starts is job by invoking the object method. We expect here a JSON array or a JSON object or another identifier (null ||false||true||string). Those choices store their results in the variable o. This method returns a java.lang.Object.

public Object object():
        {Object o;}
        {
        (
          o=array()
        | o= map()
        | o= identifier()
        )
        {return o;}
        }

JSON arrays will be returned as java.util.Vector<Object>. A JSON array is identified as starting by "[" and ending with "]", it contains zero or more JSON objects separated with a comma. Each time an element of this array is found, it is added in the vector. at the end, the vector is returned as the value of this array.

public Vector<Object>     array():
        {Vector<Object> vector= new Vector<Object>(); Object o;}
        {
        "[" ( o= object() {vector.addElement(o);} ("," o=object() {vector.addElement(o);} ) * )? "]"
                {
                return vector;
                }
        }

A JSON identifier can be a number, a string, a quoted string (need to be un-escaped), null, true or false. The content of this lexical token is obtained via the Token object which class is generated by javacc.

public Object identifier():
        {Token t;}
        {
        (
        t=<FLOATING_NUMBER>
                {
                return new Double(t.image);
                }
        | t=<INT_NUMBER>
                {
                return new Long(t.image);
                }
        | t=<IDENTIFIER>
                {
                if(t.image.equals("true"))
                        {
                        return Boolean.TRUE;
                        }
                else if(t.image.equals("false"))
                        {
                        return Boolean.FALSE;
                        }
                else if(t.image.equals("null"))
                        {
                        return null;
                        }
                return t.image;
                }
        | t=<SIMPLE_QUOTE_LITERAL>
                {
                return unescape(t.image);
                }
        | t=<DOUBLE_QUOTE_LITERAL>
                {
                return unescape(t.image);
                }
        )
        }

JSON Object will be returned as java.util.HashMap. A JSON Object starts with '{' and ends with '}'. In this case, we're passing this map as an argument each time the parser finds a pair of key/value.

public HashMap<String,Object>     map():
        {HashMap<String,Object> map= new HashMap<String,Object>(); }
        {
        "{" ( keyValue(map) ("," keyValue(map))*)? "}"
                {
                return map;
                }
        }

public  void keyValue( HashMap<String,Object> map):
        {Object k; Object v;}
        {
        (k=identifier() ":" v=object())
                {
                if(k==null) throw new ParseException("null cannot be used as key in object");
                if(!(k instanceof String)) throw new ParseException(k.toString()+"("+k.getClass()+") cannot
be used as key in object");
                map.put(k.toString(),v);
                }
        }

Compilation:

javacc JSONHandler.jj
Java Compiler Compiler Version 4.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file JSONHandler.jj . . .
Parser generated successfully.
javac JSONHandler.java

Testing:
Here is the content of test.json

{
organisms:[
        {
        id:10929,
        name:"Bovine Rotavirus"
        },
        {
        id:9606,
        name:"Homo Sapiens"
        }
        ],
proteins:[
        {
        label:"NSP3",
        description:"Rotavirus Non Structural Protein 3",
        organism-id: 10929,
        acc: "ACB38353"
        },
        {
        label:"EIF4G",
        description:"eukaryotic translation initiation factor 4 gamma",
        organism-id: 9606,
        acc:"AAI40897"
        }
        ],
interactions:[
        {
        label:"NSP3 interacts with EIF4G1",
        pubmed-id:[77120248,38201627],
        proteins:["ACB38353","AAI40897"]
        }
        ]
}

This file is parsed, a *java* object is returned and printed on screen.

java JSONHandler < test.json
JAVA OBJECT: {organisms=[{id=10929, name=Bovine Rotavirus}, {id=9606, name=Homo Sapiens}],
proteins=[{description=Rotavirus Non Structural Protein 3, organism-id=10929, label=NSP3,
acc=ACB38353}, {description=eukaryotic translation initiation factor 4 gamma, 
organism-id=9606, label=EIF4G, acc=AAI40897}], interactions=[{pubmed-id=[77120248, 38201627],
label=NSP3 interacts with EIF4G1, proteins=[ACB38353, AAI40897]}]}

Here is the complete javacc source code:

PARSER_BEGIN(JSONHandler)
import java.util.Vector;
import java.util.HashMap;

public class JSONHandler {



  public static void main(String args[])
   {
   try
   {
    JSONHandler parser = new JSONHandler(System.in);
    Object o=parser.object();
    System.err.println("JAVA OBJECT: "+o);
   } catch(Exception err)
   {
   err.printStackTrace();
   }
   }

   /** unescape a C string */
  private static String unescape(String s)
        {
        StringBuilder sb= new StringBuilder(s.length());
        for(int i=1;i< s.length()-1;++i)
                {
                if(s.charAt(i)=='\\')
                        {
                        if(i+1< s.length()-1)
                                {
                                ++i;
                                switch(s.charAt(i))
                                        {
                                        case '\n': sb.append('\n'); break;
                                        case '\r': sb.append('\r'); break;
                                        case '\\': sb.append('\\'); break;
                                        case 'b': sb.append('\b'); break;
                                        case 't': sb.append('\t'); break;
                                        case 'f': sb.append('\f'); break;
                                        case '\'': sb.append('\''); break;
                                        case '\"': sb.append('\"'); break;
                                        default: sb.append(s.charAt(i));
                                        }
                                }
                        }
                else
                        {
                        sb.append(s.charAt(i));
                        }
                }
        return sb.toString();
        }

}

PARSER_END(JSONHandler)

SKIP :
{
  " "
| "\t"
| "\n"
| "\r"
}


TOKEN  : /* LITERALS */
{
  <#LETTER: ["_","a"-"z","A"-"Z"] >
| <#DIGIT: ["0"-"9"] >
| <#SIGN: ["-","+"]>
| <#EXPONENT: ("E"|"e") (<SIGN>)? (<DIGIT>)+ >
| <FLOATING_NUMBER: (<DIGIT>)* "." (<DIGIT>)* (<EXPONENT>)?
        | (<DIGIT>)+ (<EXPONENT>) >
| <INT_NUMBER: (<DIGIT>)+ >
| <IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>|"-")* >
| <#ESCAPE_CHAR: "\\" ["n","t","b","r","f","\\","'","\""] >
| <SIMPLE_QUOTE_LITERAL:
  "\'"
      (   (~["\'","\\","\n","\r"])
        | <ESCAPE_CHAR>
      )*
      "\'"
  >
|
  <DOUBLE_QUOTE_LITERAL:
      "\""
      (   (~["\"","\\","\n","\r"])
       | <ESCAPE_CHAR>
      )*
      "\""
  >
}



public Object object():
        {Object o;}
        {
        (
        o=array()
        | o= map()
        | o= identifier()
        )
        {return o;}
        }

public Object identifier():
        {Token t;}
        {
        (
        t=<FLOATING_NUMBER>
                {
                return new Double(t.image);
                }
        | t=<INT_NUMBER>
                {
                return new Long(t.image);
                }
        | t=<IDENTIFIER>
                {
                if(t.image.equals("true"))
                        {
                        return Boolean.TRUE;
                        }
                else if(t.image.equals("false"))
                        {
                        return Boolean.FALSE;
                        }
                else if(t.image.equals("null"))
                        {
                        return null;
                        }
                return t.image;
                }
        | t=<SIMPLE_QUOTE_LITERAL>
                {
                return unescape(t.image);
                }
        | t=<DOUBLE_QUOTE_LITERAL>
                {
                return unescape(t.image);
                }
        )
        }

public Vector<Object>     array():
        {Vector<Object> vector= new Vector<Object>(); Object o;}
        {
        "[" ( o=object() {vector.addElement(o);} ("," o=object() {vector.addElement(o);} ) * )? "]"
                {
                return vector;
                }
        }

public HashMap<String,Object>     map():
        {HashMap<String,Object> map= new HashMap<String,Object>(); }
        {
        "{" ( keyValue(map) ("," keyValue(map))*)? "}"
                {
                return map;
                }
        }

public  void keyValue( HashMap<String,Object> map):
        {Object k; Object v;}
        {
        (k=identifier() ":" v=object())
                {
                if(k==null) throw new ParseException("null cannot be used as key in object");
                if(!(k instanceof String)) throw new ParseException(k.toString()+"("+k.getClass()+") cannot be used as key in object");
                map.put(k.toString(),v);
                }
        }

That's it. What's next ? jjtree is another component of the javacc package and seems to be a promising tool: it builds a tree structure from the grammar. The nodes of this tree can then be visited just like in a DOM/XML document and a language can be implemented, but here again, the documentation is succinct.

YOKOFAKUN