Hard-core JavaCC aficionados -- well, I mean those who started using the tool back in the early days and have possibly only transitioned to JavaCC 21 fairly recently -- are probably aware that LOOKAHEAD
has always been broken in various ways. Simple usage tended to work well enough, but any attempts to do anything beyond a certain threshold of sophistication using LOOKAHEAD
were liable to lead to one running full speed against a wall. I suppose that this was generally understood among "power users" but it is hard to tell, since legacy JavaCC users are, by and large, an amazingly long-suffering group of people. They never seem to have gotten very vocal about complaining about these various things -- perhaps because it was always well understood that there was little point; nothing was ever going to be done to address these problems anyway.
Actually, in terms of lookahead, I'm pretty sure that I fixed the single biggest problem with this a couple of years ago. See here. Nonetheless, there is one pretty major glitch remaining. I've been aware of this for some time and it has been kind of like a thorn stuck in my mind, but am only now turning my attention to the problem. The problem is that the scanahead logic generates code that is not quite correct when it comes to scanning ahead in a choice construct.
Consider the following grammar:
TREE_BUILDING_ENABLED=false;
BASE_NAME="Test";
INJECT TestParser :
{
public static void main(String args[]) {
new TestParser(System.in).Root();
}
}
SKIP : " " ;
NestedChoice :
SCAN 2 "foo" "bar" "baz"
|
"foo"
;
Root :
SCAN NestedChoice => {System.out.println("lookahead succeeded");} NestedChoice
|
{System.out.println("lookahead failed");}
;
You can build the above little example like so:
java -jar javacc-full.jar <grammar filename>
javac TestParser.java
And then it can be tested on the command-line with:
java TestParser
And you input a line to pass into the parser.
java TestParser
foo bar baz
Now, the above example is designed to illustrate a certain point. Consider the NestedChoice
production. It contains a binary choice: it will consume the three tokens "foo", "bar" and then "baz" (that is the first choice) OR it will consume a single "foo" token, the second choice.
However, there is a subtle point here. If the next two tokens off the stream are "foo" and then "bar", it does not check for the "baz". Thus, if the next token is not a "baz", it will not go to the next choice and consume a single "foo". This is because, once we have matched the predicate, which is the first two tokens, we have decided (correctly or not!) that we must go into the first choice. Another way of framing the issue is to realize that the next token, "baz", is not a choice point. At this point, having passed the first 2-token lookahead that was specified, we are committed to this branch and so, if the next token is not a "baz", we are going to hit an error.
So, here is the summary of what input the NestedChoice
production can match:
- The three tokens "foo", "bar", then "baz"
- The single token "foo" as long as it is not followed by a "bar"
Examples of matching input:
- "foo" "bar" "baz"
- "foo" (something other than "bar")
Examples of non-matching input:
- Anything that does not start with "foo"
- "foo" "bar" followed by something other than "baz"
Note that if the next tokens off the stream are "foo", "bar", "foo", it does not matter that the second option, the lone "foo", would have worked. In the first option, we scan ahead two tokens, so if the first two tokens match, we are committed to that first option. That is how the tool's logic works. If we wanted to check the first three tokens before committing ourselves, not just the first two, we could have written SCAN 3
or the line could be written "foo" "bar" "baz" =>||
. (As things stand, the line with SCAN 2
could be written equivalently as "foo" "bar" =>|| "baz"
)
Now let us consider the other production in the grammar, Root
. It is also a choice. The first line starts with:
SCAN NestedChoice =>
meaning that we are going to scan ahead to see if a NestedChoice
production can be matched. Here is where the plot thickens. If we run the test harness:
java TestParser
and we feed it the input:
foo bar bar
We get this output:
imac:~/projects/javacc/legacy_test>java TestParser
foo bar foo
lookahead succeeded
Exception in thread "main" ParseException:
Encountered an error at (or somewhere around) input:1:9
Was expecting one of the following:
BAZ
Found string "foo" of type FOO
at TestParser.handleUnexpectedTokenType(TestParser.java:512)
at TestParser.consumeToken(TestParser.java:504)
at TestParser.NestedChoice(TestParser.java:203)
at TestParser.Root(Test.javacc:20)
at TestParser.Root(TestParser.java:228)
The problem is that, by all rights, the lookahead SCAN NestedChoice
should have failed! But it succeeded because the choice logic when scanning ahead is not the same as when parsing. When you are parsing, the "foo" bar" sequence must be followed by a "baz" for the production to successfully parse. This is because we are committed to the first choice in NestedChoice
once the first two tokens match. If the next token is not what is expected, we do not go on to the next branch, which is matching a single "foo" token. We just hit an error condition.
But when we are scanning ahead, we do go to the next option and see if it matches. This would be the right thing to do if the predicate had failed. So if the input is "foo" "baz", then the predicate for the first choice did not match, so we go to the next choice, which is correct, because, since the predicate (matching the first two tokens) failed, we never got committed to the first branch.
Another way of looking at all of this is that there are actually two ways that a sub-expansion in a choice construct can fail.
- The predicate (if none is specified, this amounts to just checking the first token) fails.
- The predicate succeeds but then there is a failure elsewhere, at a non-choice-point.
So here are the two key cases:
If the predicate of the sub-expansion in a choice fails, then we move on to check the next sub-expansion in the choice construct.
If the scanahead of the sub-expansion in a choice fails BUT the predicate succeeded, then we should NOT go to the next choice. The scan of the choice construct as a whole is considered to have failed.
In other words:
If the nested lookahead succeeded in a choice construct, there should NOT be an attempt to match any subsequent sub-expansions in the choice construct.
This is not currently dealt with correctly. When we are scanning ahead (as opposed to parsing) we go to the next sub-choice regardless of whether the predicate succeeded or not! This is really a major glitch that dates back to legacy JavaCC. It has always been there. Of course, this sort of problem is far worse in legacy JavaCC because nested syntactic lookahead is simply ignored completely in legacy JavaCC. JavaCC 21 does address this, but we are left with this problem that the logic is still not quite right.
Again, this is clearly a bug, because the principle of least surprise really requires that if we write:
SCAN SomeProduction => SomeProduction
or in the legacy syntax:
Lookahead(SomeProduction()) SomeProduction()
or more economically in the new syntax:
SomeProduction =>||
we should expect the semantics of the scanahead to be the same as in the parsing. If the lookahead succeeded, we should expect the parsing to succeed. However, in the example I give here, the lookahead can succeed yet the parsing can still fail.
What to Do?
Well, this is a bug and the thing to do (quite uncontroversially, I would say) is to fix it. However, it must be admitted that the bug-fix is not absolutely backward compatible. It is possible that there are grammars that rely on this working the way it works currently -- regardless of it being wrong!
My current tendency is to fix the bug in JavaCC 21, but to put in a setting allowing one to turn on the old (screwy) way of working.
When the CongoCC transition is done, we won't take that setting with us. In general, this is a recurring theme. Certain things will be kept working in JavaCC 21 but won't be around in CongoCC. I do not anticipate any support for legacy JavaCC syntax in CongoCC. Even some things that were added in JavaCC 21 will not be brought over to Congo. (Hey, if you're gonna go on a trek in the jungle, you don't want to carry a bunch of unnecessary junk with you, eh?)
In JavaCC 21, you can write:
SCAN Foo Bar Baz
or:
=> Foo Bar Baz
and that is the same as:
Foo Bar Baz =>||
I think we're only taking the last of the three to Congo. But I anticipate having an automatic converter tool that converts older deprecated syntax to the newer syntax, so you can run that over a legacy grammar and things like:
void Foobar() :
{}
{
"foo" "bar"
}
will just get rewritten as:
Foobar : "foo" "bar" ;
And so on. And the streamlined syntax is all that will be supported.
Well, in closing, looking back, I realize that I haven't done much with this for at least a half year, but I was sort of absorbed with other things. I'm writing this on this discussion forum (as opposed to the similarly quiet blog because I am painfully aware that the thing has been quiet for a long time and I would like to see if I can remedy that state of affairs. So, certainly, for anybody who is lurking, please feel free to say something. Don't worry about saying anything dumb. I say dumb things often enough myself!