Hi there,
It has been some time since I had to make changes to my parsers, but I needed to add some new features, so I figured I'd try out a newer version of CongoCC then the 2022-05-31 version in the current build.
Some of the interfaces changed slightly, not big deal, but to my surprise, the parser itself also broke!
Was expecting one of the following:
B, C
Found string "bbbbbbb" of type INVALID
Ehm... It can't find a B in the b's? That b odd...
Fortunately, the parsers generated by CongoCC are really easy to understand and debug (that may well be my favourite feature of CCC) and given the rather odd error message, I figured the problem must be related to the Lexical state switching. Or maybe I was just holding it wrong. So I dug in, created a minimal reproducer, and found a workaround.
PARSER_PACKAGE=test1;
USE_CHECKED_EXCEPTION=true;
<DEFAULT,LS_A,LS_B> SKIP: " ";
<DEFAULT> TOKEN:
<UNUSED: "unused">
;
<LS_A> TOKEN:
<A: "a" >
;
<LS_B> TOKEN:
<B: "b" >
;
<LS_C> TOKEN:
<C: "c" >
;
Start : LS_A :
FindABC <EOF>
;
FindABC :
(<A>)? (TempB | TempC)
;
FindA :
<A>
;
TempB :
FindB
;
FindB : LS_B :
<B>(<B>)*
;
TempC :
FindC
;
FindC : LS_C :
<C>(<C>)*
;
You can probably directly spot where it goes wrong:
(<A>)? (TempB | TempC)
This production generates:
if (nextTokenType() == A) {
consumeToken(A);
}
if (nextTokenType() == B) {
...
} else if (nextTokenType() == C) {
...
}
But this can't work, since nextTokenType()
can't find a B here, since that would require a lexical state switch. That explains the type INVALID
: the current lexical state only knows A
, so it can't match the b
.
The workaround is adding a SCAN
before the TempB
call, which happens automatically if the lexical state switch is not hidden deeper in the production tree. This used to work automatically even with nesting. And it seems to partially still do, since the error message indicates it knows it should be looking for a B
!
The other odd thing is that while searching for the A
it goes all the way to the end of the input, instead of giving up right at the first non-a
. That seems somewhat inefficient, though for my use-case it's irrelevant, since my inputs are quite small.