Yeah, well, actually, JavaCC 21 (and therefore CongoCC) does not support setting a global lookahead other than 1. At some point (and I can't remember exactly when) I just removed that option. So, the default lookahead amount is now always 1.
In general, though, I think that numerical lookahead is actually a very screwy, error-prone sort of idea to start with. (Leaving aside a globally set lookahead amount that is other than 1, which seems like an extra bad idea generally). I mean, like, if you have:
LOOKAHEAD(2) "foo" bar" "baz"
the alternative using up-to-here notation would be:
"foo" "bar" =>|| "baz"
And it's not just that this achieves a much greater economy of expression, but I think, more importantly: this actually corresponds much better to how a human thinks about the problem.
I really think so. A human trying to read code thinks to himself: Well, I scan up to this point and then I can stop because I know that this must be a method declaration. A human does not think: "I need to scan exactly 7 tokens ahead". And besides that, that approach does not even work in the general case! I mean, there's no specific number of tokens you can scan ahead to identify a class declaration, say, and I think that's usually the case with the main constructs in any fairly complex language. So, in the Java grammar, you have:
ClassDeclaration :
{permissibleModifiers = EnumSet.of(TokenType.PUBLIC, TokenType.PROTECTED, TokenType.PRIVATE,
TokenType.ABSTRACT, TokenType.FINAL, TokenType.STATIC, TokenType.STRICTFP, TokenType.SEALED,
TokenType.NON_SEALED);}#
Modifiers
"class" =>||
TypeIdentifier
[ TypeParameters ]
[ ExtendsList]
[ ImplementsList ]
[ PermitsList ]
ClassOrInterfaceBody
;
We scan past whatever modifiers there are (if any) and then, once the next token is "class", well, we don't need to scan any more, this definitely must be a class declaration! That's how it's expressed and that's how a human thinks about the problem. (Well, this human anyway!)
And, of course, even if a numerical lookahead can be used, it still tends to be more fragile, so if you have:
LOOKAHEAD (4) Foobar() Baz()
because a Foobar production always consumes 3 tokens but you also need to check whether the next token in Baz is the right one as well, so it's a lookahead of 4 tokens, but if, due to whatever language evolution, you rewrite the Foobar
production so that it is potentially 4 or 5 tokens, say... then all these LOOKAHEAD(n) things have to be rewritten to reflect that. But using up-to-here-plus notation, we can write:
Foobar =>|+1 Baz
And that doesn't need to be changed. It's much more robust than the code that uses a numerical lookahead. And again, I think it also corresponds more closely to how a human being actually thinks of the problem.
So, in the Java grammar here you have:
Initializer# :
[ "static" ] =>|+1 Block
;
This could admittedly be expressed with a syntactic lookahead, but it has to be there every time you use the Initializer production at a choice point. And it's really kind of ugly:
LOOKAHEAD (["static"] "{") Initializer()
But if you use a numerical lookahead, you would have to write:
LOOKAHEAD(2) Initializer()
But there it is actually looking ahead 2 tokens in many cases when it doesn't need to. It only needs to look ahead 2 tokens if the next token is "static". Otherwise, a single token lookahead is enough. Actually, I suppose if one were going to look at what I tend to think of as best practices in terms of writing a grammar, the best reference could well be the C# grammar. That, by the way, was quite difficult to write and it really stresses the CongoCC feature set to the limit! I'm pretty sure that the legacy JavaCC is simply not powerful enough to write a decent grammar for that language.