Error reporting

opeongo

I am interested in learning more about the error reporting capabilities of CongoCC.

I have a JavaCC-based custom embedded expression language that can evaluate expressions against a data model in my application. The expression language supports the usual math, string, logical and relational operators plus a large set of functions. Expressions can be entered and compiled, and then subsequently evaluated against records in the data model. I would like to improve the error reporting capabilities of this language, possibly by converting it to CongoCC.

There are three points in the expression life cycle where errors might occur:

during parsing, when the inputs do not match the language (for example correct number of arguments);
during semantic analysis, when the arguments are checked to see if they have appropriate data types (for example binary operators have data type agreement); and
during run time, for example when a substring index is out of bounds of the string.

Ideally I would like to handle each of these errors in a similar way:

provide a description of the error (e.g. something like "function cos takes one numeric argument" or "substring index out of bounds (index=10, string length=5)"
provide a context for where the error occurred: print a portion of the expression showing the closest token to where the error occurred.

I can trap errors of type 2 and 3 easily in the Java code that does the semantic analysis and runtime testing. Is it possible to put put try-catch blocks in to the grammar so that parsing failures are also caught and my custom messages are thrown?

My next question is, how to print the expression context when dealing with an error situation? Of course when you catch a parsing error you know the exact spot in the input where the train jumped the tracks. But for semantic and runtime errors how is it possible to reconstruct the expression, given that you know what node you are in (and possibly what token is related to that node). I recall Jonathan (@revusky) mentioning that CongoCC can fully reconstruct inputs from any token, but I cannot seem to find this post again.

vsajip

You should just be able to do toString() of any node to get its source. See this simple grammar:

PARSER_PACKAGE=ex;
PARSER_CLASS=Calc;

SKIP : " " | "\t" | "\n" | "\r" ;

TOKEN :
   <PLUS : "+">
   |
   <MINUS : "-">
   |
   <TIMES : "*">
   |
   <DIVIDE : "/">
   |
   <COMMA : ",">
   |
   <LPAREN : "(">
   |
   <RPAREN : ")">
   |
   <NUMBER :  (["0"-"9"])+ ("."(["0"-"9"])+)?>
   |
   <COS : "cos">
   |
   <SIN : "sin">
   |
   <TAN : "tan">
   |
   <ATAN : "atan">
   |
   <ATAN2 : "atan2">
;

AdditiveExpression :
    MultiplicativeExpression
    (
      (<PLUS>|<MINUS>)
      MultiplicativeExpression
    )*
;

MultiplicativeExpression :
    (<NUMBER> | ParentheticalExpression)
    (
       (<TIMES>|<DIVIDE>)
       (<NUMBER> | ParentheticalExpression)
    )*
    | FuncExpression
;

FuncExpression :
    (<COS> | <SIN> | <TAN> | <ATAN> | <ATAN2>) { CURRENT_NODE.kind = peekNode(); }
    <LPAREN>
    [
        AdditiveExpression { CURRENT_NODE.params.add(peekNode()); } ( <COMMA> AdditiveExpression { CURRENT_NODE.params.add(peekNode()); } )*
    ]
    <RPAREN>
;

ParentheticalExpression :
    <LPAREN>
    AdditiveExpression
    <RPAREN>
;

Root : AdditiveExpression <EOF> ;

INJECT PARSER_CLASS :
    import java.util.Scanner;
{
    static public void main(String[] args) throws ParseException {
       while (true) {
         System.out.println("Enter an arithmetic expression:");
         Scanner scanner = new Scanner(System.in);
         String input = scanner.nextLine();

         if (input.equals(""))
            break;


           try {
               PARSER_CLASS parser = new PARSER_CLASS(input);
               parser.Root();
               Node root = parser.rootNode();
               System.out.println(String.format("Dumping the AST of '%s'...", root));
               root.dump();
               System.out.println("The result is: " + root.evaluate());
           }
           catch (ParseException e) {
             System.err.println("Failed to parse expression: " + input + ": " + e);
           }
           catch (UnsupportedOperationException e) {
             System.err.println("Failed to evaluate expression: " + e);
           }
        }
    }
}

INJECT Node :
{
    default double evaluate() {throw new UnsupportedOperationException();}
}

INJECT FuncExpression :
{
    public Node kind;
    public List<Node> params = new ArrayList<>();

    public double evaluate() {
        String s = kind.toString();
        int numParams = (s.equals("cos") || s.equals("sin") || s.equals("tan") || s.equals("atan")) ? 1 : 2;
        int n = params.size();

        if (numParams != n) {
            throw new UnsupportedOperationException(String.format("calling %s with %d parameters is not supported", s, n));
        }
        if (s.equals("cos")) {
            return Math.cos(params.get(0).evaluate());
        }
        if (s.equals("tan")) {
            return Math.tan(params.get(0).evaluate());
        }
        if (s.equals("atan")) {
            return Math.atan(params.get(0).evaluate());
        }
        if (s.equals("atan2")) {
            return Math.atan2(params.get(0).evaluate(), params.get(1).evaluate());
        }
        else if (s.equals("sin")) {
            return Math.sin(params.get(0).evaluate());
        }
        throw new UnsupportedOperationException(String.format("calling %s is not supported", s));
    }
}

INJECT NUMBER :
{
    public double evaluate() {
        return Double.parseDouble(toString());
    }
}

INJECT AdditiveExpression :
{
    public double evaluate() {
        double result = get(0).evaluate();
        for (int i=1; i< size(); i+=2) {
            boolean subtract = get(i) instanceof MINUS;
            double nextOperand = get(i+1).evaluate();
            if (subtract) result -= nextOperand;
            else result += nextOperand;
        }
        return result;
    }
}

INJECT MultiplicativeExpression :
{
    public double evaluate() {
        double result = get(0).evaluate();
        for (int i=1; i< size(); i+=2) {
            boolean divide = get(i) instanceof DIVIDE;
            double nextOperand = get(i+1).evaluate();
            if (divide) result /= nextOperand;
            else result *= nextOperand;
        }
        return result;
    }
}

INJECT ParentheticalExpression :
{
    public double evaluate() {
        return get(1).evaluate();
    }
}

INJECT Root :
{
    public double evaluate() {
        return get(0).evaluate();
    }
}

If you compile it using java -jar congocc.jar Arithmetic.ccc and then javac ex/Calc.java and then java ex.Calc you can play around with what happens when you type in correct or wrong data.

opeongo

Thank you for the suggestion. I was looking for a bit more than just the toString() of the token. Ideally I would like to be able to print the context of where the error occurred within the token stream.

Here is an example using the above grammar. Let say I input a incorrectly formed expression like:
7 * 25 + 15 * * 6

A parse error is expected at the third occurrence of the '*' token. To show this parse error in context I would like to be able to print a message like:

Unexpected input at line 1 column 15:  expecting an operand, found "*"
7 * 25 + 15 * * 6
              ^

This message shows the line in error, with an indicator of the column position where the error occurred. I would find this type of error message very helpful to quickly pinpoint the source of the problem. This is a simple example, but when the expressions are longer, possibly spanning multiple lines, it can be a challenge to figure out where column 86 is in the input string.

I can format the message, but what I would need to do is be able to partially reconstruct the input expression. I recall Jonathan explaining in a post a few years ago how to do this. My memory is vague, but I believe that token objects had prev and next token pointers, and even white space was retained in the token stream. So if this is the case then I could create this "context" by starting at the token having the problem, prepend several previous tokens on the front, append several following tokens on the end.

So this is what I am asking about. Is it possible to chain backwards and forwards through the list of tokens in order to reconstruct a portion of the input expression? And is this information available both at parse time and later at evaluation time when the ast is evaluated?

revusky

opeongo Ideally I would like to be able to print the context of where the error occurred within the token stream.

Hi Tom,

Any Node object in CongoCC -- whether it is a nonterminal or terminal (i.e. Token) has a fairly significant API by default. Consider this snippet from the Node.java.ftl template, which is the template used to generate the base Node interface: https://github.com/congo-cc/congo-parser-generator/blob/main/src/templates/java/Node.java.ftl#L436-L443

That getLocation() is just a convenience method that is generated by default. You could override it if you want different text. Or you could INJECT your own specialized method, like:

 INJECT Node : 
 {
      public default String getCustomizedLocation() {....}
 }

Any Node object also has a nextSibling() and previousSibling() method in it as well. See: https://github.com/congo-cc/congo-parser-generator/blob/main/src/templates/java/Node.java.ftl#L253-L268

So, I guess what you really need to do is get familiar with the API that is generated by default and see what you can do with it. If it doesn't do exactly what you want, you can override the implementations in BaseNode.java and Token.java.

So this is what I am asking about. Is it possible to chain backwards and forwards through the list of tokens in order to reconstruct a portion of the input expression?

Well, any node/token can tell you its starting and ending offset. So, for example, if you want the text starting with node a and ending with node b, this should work:

 String srcAToB = myLexer.subSequence(a.startOffset(), b.endOffset()).toString();

But I guess you really need to just start using CongoCC and then you can eyeball the generated code (or javadocs) and see what API is automatically generated and see what you can do with it.

In terms of handling parsing errors, you would need to eyeball what API is automatically generated for ParseException.java. Here is the template used for generating the ParseException.java file: https://github.com/congo-cc/congo-parser-generator/blob/main/src/templates/java/ParseException.java.ftl

In particular, the getToken() method should return the problematic token that the parser tripped up on. Of course, in an error message, you might need to use a preceding token/location to give a more understandable error message. But that's all kind of heuristics, as it were...

And is this information available both at parse time and later at evaluation time when the ast is evaluated?

No information is thrown away. Any nonterminal Node or Token "knows" its location and so on and there is API to get the things in the context, like the node or token immediately preceding or following and so on.

I don't know whether I should say this exactly, but I find it kind of perplexing that you are still using the old legacy JavaCC. I mean, the fact is that transitioning to CongoCC has whatever fixed cost whether you do it earlier or later and you might as well do it earlier since the cost is the same and you get the benefits immediately. Well, that is not about you specifically, Tom. It so happens that it is very hard to get people to switch. Even in situations where I've volunteered to do most of the work for them, like with that ancient Beanshell project. It's not just about how inferior and limited the legacy JavaCC is, but also, if you ask the ostensible maintainers any question about how the thing works, you'll never get an answer really. It's really just completely unsupported abandonware. Of course, they want to pretend that that is not the situation, but I'm sure you know the score...

Out of curiosity, how many lines of legacy JavaCC grammar code do you have?

vsajip

opeongo This message shows the line in error, with an indicator of the column position where the error occurred. I would find this type of error message very helpful to quickly pinpoint the source of the problem. This is a simple example, but when the expressions are longer, possibly spanning multiple lines, it can be a challenge to figure out where column 86 is in the input string.

It can be fiddly, but not really that difficult as the tokens have full line and column information, and you can scan the source to find the exact line where the error occurred, print that line and the caret. I changed my simple example's try catch handling to

         try {
               PARSER_CLASS parser = new PARSER_CLASS(input);
               parser.Root();
               Node root = parser.rootNode();
               System.out.println(String.format("Dumping the AST of '%s'...", root));
               root.dump();
               System.out.println("The result is: " + root.evaluate());
         }
         catch (ParseException e) {
              Node n = e.getToken();
              int errCol = n.getBeginColumn();
              Node root = parser.rootNode();
              int startCol = 1;
              System.err.println("Failed to parse expression: " + input + ": " + e);
              System.err.println(input);
              int spaces = errCol - startCol + 1;
              String fmt = "%" + spaces + "s";
              System.err.println(String.format(fmt, "^"));
         }
         catch (UnsupportedOperationException e) {
           System.err.println("Failed to evaluate expression: " + e);
         }

and I seem to get reasonable error reporting, though you can of course tweak the wording, formatting etc. to suit. What I get with an erroneous input:

Enter an arithmetic expression:
7 * 25    + 15 *     * 6
Failed to parse expression: 7 * 25    + 15 *     * 6: ex.ParseException: 
Encountered an error at (or somewhere around) input:1:22
Was expecting one of the following:
LPAREN, NUMBER
Found string "*" of type TIMES
7 * 25    + 15 *     * 6
                     ^
Enter an arithmetic expression:

Obviously the detail of what's expected is grammar-dependent, but it's clear to see that a CongoCC-generated parser gives the expected token types, the encountered erroneous token and you can draw a caret at the error position using the information available. With multiple lines, the problem isn't much harder as you just have to locate the error line (available from e.getToken().getBeginLine()) and print the source line before the caret line is printed. The source is always available and you can scan backwards and forwards from e.getToken().getBeginOffset() for \n to find the source line where the error occurred.

opeongo

revusky Yes, I know the score with legacy javacc. It is not going to change. However my grammars run fine, so it is currently the least cost solution for me.

Finding the time to convert to congocc, just to convert to congocc, is a challenge. But I would like to improve the error message capability of my expression engine so that is the reason that I think it might be the time to make the switch. Pay the cost and get better error messages.

I have 4 legacy grammars. The most complex is 1500 lines in the jjt file, resulting in about 200 node classes. The hand coded node files have about 35k lines of code (including comments/whitespace). This is the one that I am experimenting with in congocc.

opeongo

vsajip Thank you for this. I see that you are using the input string, rather than getting the context from the tokens. I will have to reflect on this and how it will work with the kind of error messages that I want to generate, and in the different contexts that I wanted to generate them from.

I think that I will do as Jonathan suggests and I will play around with the API and see what I can do. When I have something useful I will show it here.

vsajip

opeongo Sure. I used the input as a convenience, but as I explained in my last post, it's easy enough to generalise it. The tokens themselves will each contain their start and end offsets, and start and end line and column information, so you could of course reconstruct the input from them, but you would have to account for whitespace between the tokens. In my view, it's easier to just fetch the source line corresponding to the error token and go from there. It's been so long since I used legacy JavaCC that I cannot remember its API, but the CongoCC API is pretty capable of being used for good error reporting.

vsajip

I've made a slight change to the Token API to facilitate getting the source line. So with the latest congocc.jar, the exception clause in the parsing loop above could look like this:

                Token token = (Token) e.getToken();
                int errCol = token.getBeginColumn();
                Node root = parser.rootNode();
                int startCol = 1;
                System.err.println("Failed to parse expression: " + input + ": " + e);
                System.err.println(token.getSourceLine());
                int spaces = errCol - startCol + 1;
                String fmt = "%" + spaces + "s";
                System.err.println(String.format(fmt, "^"));

The token.getSourceLine() gets the line of source which contains the token.

adMartem

Just to throw another chicken in the pot, I use the ATTEMPT ... RECOVER extensively in my grammar to catch the error, issue an appropriate message and skip over what I think is the problem. Unlike JavaCC's try/catch, this leaves the parser in a known state prior to performing the RECOVER code and/or expansion. This might be something to look at as well.