adMartem
Well, it seems to me that the way it is behaving is correct -- at least based on the "spec" I set out above. In:
A : C ASSERT ~("d") | CD =>|| ;
As I see it, the most fundamental rule of JavaCC (21 anyway) is that in the absence of any indication to the contrary, we're just scanning ahead one token to resolve the decision at a choice point. So the first choice in the production A
above is just going to scan the "c" and then jump out. So it won't hit the assertion. If you wrote:
A : C ASSERT ~("d") =>|| | CD =>|| ;
then it would scan to the end and would check the assertion and when that fails, go into the next choice CD. OTOH, if you wrote:
A: ASSERT {getToken(2).getType() != TokenType.D}# C || CD ;
then I guess that would work. It's always going to hit the assertion at that point because we haven't scanned ahead a single token yet. Well, I think the above would work, but I'm not recommending it. It's terribly ugly.
Now, I would grant one point about this. The first expansion above, where we're only scanning one token ahead, the second choice CD
is unreachable. And it's not too hard to check for something like that at build-time and emit a warning. And, in fact, the legacy tool would warn about this. (Though the warnings it emitted tended to be hard to understand, and actually, IMHO, not even correct...) At some point, I tore out all that code, in large part because it was just written so horribly, and then I figured I'd put some of the warnings back in later (but I never did.) In general, as long as it's single-token lookahead, it's not too hard to warn about such things. Of course, as I point out here regular programming languages do not warn in analogous situations. If you have:
if (x>0) {
...
}
else if (x>1) {
...
}
the code in the else-if block is pretty obviously unreachable and nobody considers this a big deal. (Of course, that might be because of the theoretical possibility that some other thread could change x in the nanosecond or so between the first check and the second one. That's obviously... well.... but it is a theoretical possibility, so....)
Still, in terms of this pure FIRST SET sort of stuff, I think there would be some value to warning that a choice is unreachable. Sure, why not? But one wrinkle in all this is that in the whole "context-free grammar" sort of framing, this problem is presented as an "ambiguity", and I honestly don't agree with that framing. IMHO, it is plainly obvious that a tool like this sequentially goes through the choices and takes the first one that matches. That there was a later listed choice that matches, in my mental model, does not constitute an "ambiguity". It's just how the tool works. It's not an "ambiguity", strictly speaking, but quite possibly worth warning about, because it probably is not going to do what the coder really intended... And funny enough, the lexer side is based on the idea that the first match wins. The string "for" matches the pattern for an identifier but also matches the pattern for the keyword "for". But we match the keyword because that is stated earlier and the earlier match wins. Again, nobody thinks that is a big deal. Of course, the string "forget" matches identifier, not the keyword "for" followed by the identifier "get" because we match as much input as we can, a.k.a. greedy matching. And I kinda think everybody beyond neophytes understands all this. (I mean without even studying the question.... it just works about the way one would expect!)
Well, another point about all this is that even if we had the dead code check, writing the C expansion this way:
C : SCAN 1 {true}# => "c";
would probably cause the check to be short-circuited. You see, once you have the semantic predicate, which is Java code, the logic would just assume that it possibly returns false. It's not even going to do the minimal analysis to see that the prediate always returns true because it is the literal "true"! And once you assume that the next token can be "c" but the lookahead still fails because the Java code condition returns false, then we can't be sure that the second choice is dead code. (Though paradoxically, a human just eyeballing it can see that pretty easily, but since we never scan even the most minimally into any code that is expressed in Java. It's a black box in terms of our analysis.)
But anyway, I think the code snippet you provide is behaving correctly.