Over the last little while, I got very obsessive about the whole problem of the Java parser identifying things that are erroneous. This was, to some extent, a consequence of my foray into ANTLR-land. I noticed that the Java grammar here did not (and does not) incorporate any knowledge about what a valid statement in Java is. So, it uncomplainingly parses "statements" such as:
x;
or:
2 = 4;
and so on. Meanwhile, the Java parser that is an integral part of CongoCC does have a fair bit of machinery to detect these things. It does "know" that an expression can only stand up as a statement if it is:
- a method invocation
- an assignment
- the instantiation of an object i.e. new Foobar(...)
Likewise, an assignment can only be valid if the left-hand-side is assignable, which is why 2=4
is not valid. The LHS is 2
and cannot be assigned a value.
A very natural way of dealing with this is to have sanity-check assertions at key points. Thus, in the Java grammar, we have:
{Expression lhs = (Expression) peekNode();}
(AssignmentOperator Expression) #AssignmentExpression(+1)
ASSERT {lhs.canBeAssignedTo(), lhs}
: "The expression " + lhs + " cannot be assigned to."
So, when the assertion fails, which would be the case if we have something like 2=x
since 2
cannot be assigned to, we throw an exception and so on.
So, I had an epiphany about this and realized that, in the case of fault-tolerant parsing, we would throw the exception in such a case, and then the machinery would catch the exception and then try to scan forward to find a resynch point where it can get back on the rails. This was my thinking on the question.
Just a few days ago, I realized something. Hold on. If we are in a fault-tolerant mode, why are we throwing the exception in the first place? I mean, we can perfectly well just keep on parsing past invalid statements like that. We store the location info and so on, that there was an error of a certain type at that point, but we can certainly just keep parsing.
Then it occurred to me that there may be assertions where we want to just keep on parsing forward, and then there could be assertions where we want to throw an exception regardless. However, it seems to me that, in practice, the vast majority of assertions are ones where we could just keep on going if the assertion fails. In fact, our Java grammar currently has 21 ASSERT
statements and, AFAICS, every last one of them would be things one would want to let by in a fault-tolerant parsing mode.
Still, it is possible that some ASSERT
statements do refer to conditions that we want to be fatal even in fault-tolerant mode and others are things that we can parse past, like the fact that the LHS of an assignment is not assignable, i.e. 2 = x
sorts of things. We store the information and just keep going. In the latter case, I was thinking about using a new keyword, called FLAG
, which would mean that we check the condition, and if the check fails, we flag it and we have the information about where it occurred, but we just keep parsing. (I originally was thinking in terms of TOLERATE
but then decided that was too long-winded and am now tending towards FLAG
.)
I am also thinking that it should be a significant goal in the coming months to have a parser for Java that really does fault-tolerant parsing well.
Actually, I am not sure that there is such a strong need for a separate keyword FLAG
. It may well be that any assertion should be "tolerated" in fault-tolerant mode. (Or maybe not. I'm not 100% sure.) But anyway, people are welcome to give some opinions about this.
Oh, here is another general improvement that has been made.
The full syntax for an ASSERT
is now:
`ASSERT {condition, location : message}
(This is for assertions that are expressed in Java code. For the moment, the ones expressed as an expansion are unchanged.) The two arguments location
and message
are optional. The condition
is code in Java (OR the target language actually, could also be Python or C#). If the next token is a comma, then it is followed by a reference to a Node
(or Token
object, but a token is a node!) that is the location to report as being where the problem is. And then there is the optional :
followed by a message, which is the error message.
The point of the extra location
option is that previously, it just used the next token as the location of the error, which is sometimes correct, but only sometimes. (Though it probably is a reasonable default assumption.) But you may have noticed that the error locations in messages are very often one (or a few) tokens off from where the error really happened. For example, if you write:
(x + y);
the last consumed token is the )
and the next token is the semicolon, so it is liable to tell you that you have a problem where the semicolon was, which is maybe not so horrible, since most people would go there and see what the problem is. (In practice, getting the location approximately right is going to be good enough quite often, but...)
But now, it gives the problematic location as being the start of the (x+y)
expression. So we have:
Assertion at: Java.ccc:1138:7 failed. Expression (x+y) at ../../Foo.java:7:9 is not a valid statement.
Of course, all of this machinery to provide maximally accurate error messages is also available to you (by "you" I mean, people writing grammars) and the Java grammar (and eventually the other ones, like for C# and Python and Lua) should be good examples of how to do this.