Hi there,

It has been some time since I had to make changes to my parsers, but I needed to add some new features, so I figured I'd try out a newer version of CongoCC then the 2022-05-31 version in the current build.
Some of the interfaces changed slightly, not big deal, but to my surprise, the parser itself also broke!

Was expecting one of the following:
B, C
Found string "bbbbbbb" of type INVALID

Ehm... It can't find a B in the b's? That b odd...

Fortunately, the parsers generated by CongoCC are really easy to understand and debug (that may well be my favourite feature of CCC) and given the rather odd error message, I figured the problem must be related to the Lexical state switching. Or maybe I was just holding it wrong. So I dug in, created a minimal reproducer, and found a workaround.

PARSER_PACKAGE=test1;
USE_CHECKED_EXCEPTION=true;

<DEFAULT,LS_A,LS_B> SKIP: " ";

<DEFAULT> TOKEN:
    <UNUSED: "unused">
;
<LS_A> TOKEN:
    <A: "a" >
;
<LS_B> TOKEN:
    <B: "b" >
;
<LS_C> TOKEN:
    <C: "c" >
;

Start : LS_A :
	FindABC <EOF>
;
FindABC :
    (<A>)? (TempB | TempC)
;
FindA :
    <A>
;
TempB :
  FindB
;
FindB : LS_B :
    <B>(<B>)*
;

TempC :
  FindC
;
FindC : LS_C :
    <C>(<C>)*
;

You can probably directly spot where it goes wrong:

     (<A>)? (TempB | TempC)

This production generates:

            if (nextTokenType() == A) {
                consumeToken(A);
            }
            if (nextTokenType() == B) {
                ...
            } else if (nextTokenType() == C) {
                ...
            }

But this can't work, since nextTokenType() can't find a B here, since that would require a lexical state switch. That explains the type INVALID: the current lexical state only knows A, so it can't match the b.
The workaround is adding a SCAN before the TempB call, which happens automatically if the lexical state switch is not hidden deeper in the production tree. This used to work automatically even with nesting. And it seems to partially still do, since the error message indicates it knows it should be looking for a B!

The other odd thing is that while searching for the A it goes all the way to the end of the input, instead of giving up right at the first non-a. That seems somewhat inefficient, though for my use-case it's irrelevant, since my inputs are quite small.

    hylkevds
    Hi, thanks for the report. Sorry to be a blt slow in answering. I honestly hadn't seen it until just now.

    Hmm, it seems that the problem is fairly straightforward. Basically, the generated code is using single-token lookahead logic when it should be generating a predicate method that would presumably do the lexical state switch as part of the lookahead.

    Possibly one solution (though it's a bit of a hack admittedly) would be to trick the thing into generating a predicate method. Thus, it might be that if you rewrite the production:

      TempB : 
           FindB =>||
      ;

    then I think it will generate a predicate method and that predicate method should include the lexical state switching.

    Another thing that I am wondering about is whether the problem is present if, for example, you write:

      FindB : 
           LEXICAL_STATE LS_B (<B>(<B>)*)
      ;

    I could check these things myself, and I eventually will, I suppose. But it could be useful in terms of narrowing in on where the issue is.

    As for it scanning forward to the end of the input, I suppose it's doing that because it's trying to capture all the invalid characters. But I don't suppose there is much of an efficiency issue since it is only doing that when it hits an error anyway.

    No worry about quick replies, I worked around it with a SCAN.

    Using your suggestion with =>|| also fixes the problem.
    Changing the content of FindB to LEXICAL_STATE LS_B (<B>(<B>)*) does not fix it.

    The "Scanning to the end" happens in the if (nextTokenType() == A) { call, so it also happens when parsing is successful. Though it doesn't exactly scan to the end, it scans to the first SKIP character in this test case. I guess it scans to the first character that is valid is the current lexical state. I guess that in an unfortunately designed parser that may cause quite some overhead. Maybe limiting the invalidChars search in tokenizeAt to 2 or 3 is in order?

      hylkevds No worry about quick replies, I worked around it with a SCAN.

      That shouldn't be necessary any more, at least against the latest build. It should be fixed. You can check out the code and build, of course, but here is a link to a prebuilt jar.

      The "Scanning to the end" happens in the if (nextTokenType() == A) { call, so it also happens when parsing is successful.

      Ah, yes, I see. I was only thinking about the case when you are actually parsing, not in a lookahead. It is possible that if you lookahead in that case, and it rolls up an InvalidToken for all the characters that it can't lex in that lexical state, then it could be scanning forward a lot of characters! Probably not a very common situation, but it is conceivable. And it is true that, in terms of a lookahead failing, all it needs is to find one invalid character and the lookahead can fail, so it is potentially inefficient. Well, it's something to bear in mind, but it is probably not such a common problem.

      Anyway, do let me know if there is any issue with the latest version.

      9 days later

      I think there might be a problem with your solution in the latest revision, or I have a "scaffolding" bug somewhere that depends on the previous behavior. My parser works with it on 99.5% of my regression tests, but fails in a variety of places in the other .5%. On initial examination it seems to be causing a large number of additional predicate methods to be generated, and at least the few I've looked at don't seem to have any dependency on a lexical state change. I will investigate further.

        An update on this. I take back the part about no lexical state change dependency. I think they all do, but sometimes it is way down in the parser expansion tree below where the former simple token "contains" turned into a scan. There are around 50 more non-terminals in my parser that return a positive indication of starting with a lexical state change than formerly were there. That leads to a lot more scans, obviously. I note that of the example parsers, this change only produces 1 Python non-terminal that differs in that way, and about this many for CSharp:

        non-terminals that return differing "startsWithLexicalChange" result: [NonNullableType, InListPattern, SwitchExpression, StatementExpression, NonTupleType, ConstantDeclaration, UnaryExpression, RangeExpression, WithExpression, AndPattern, ArrayType, ElementInitializer, AndExpression, AssignmentExpression, ExplicitAnonymousFunctionSignature, CompilationUnitBody, ConditionalAndExpression, MultiNonInterpolatedText, NonArrayType, EqualityExpression, MultiplicativeExpression, FormalParameters, AdditiveExpression, RelationalExpression, InclusiveOrExpression, ConditionalExpression, ExclusiveOrExpression, NullConditionalExpression, NullCoalescingExpression, ShiftExpression, ExplicitAnonymousFunctionParameter, SubPatterns, ArgumentList, MemberDeclarator, InterfaceMemberDeclaration, EmbeddedStatement, MemberInitializer, Expression, ExpressionStatement, OrPattern, ClassMemberDeclaration, NotPattern, InterfaceIndexerDeclaration, VariableInitializer, ConditionalOrExpression, PrimaryPattern]

        I still don't know for sure why my parser is broken, presumably as a result of a predicate replacing a simple next token check. More as I delve further into this.

          adMartem I still don't know for sure why my parser is broken, presumably as a result of a predicate replacing a simple next token check. More as I delve further into this.

          Well, that is what I find mysterious. If it is generating a lot more predicate routines (probably in many cases where it was not strictly necessary) it doesn't seem like that should cause any test failures -- or really, that it should change the behavior of the generated parser in any way, really.

          The only explanation I can think of is that your parser now matches at a point where it previously did not match.
          In my example:

          (<A>)? (TempB | TempC)

          It used to fail because it did not generate a scan for tempB, and the input contained nothing that could be matched by the lexical state of A. But what if in my example, TempC would have had the same lexical state as A, and would have matched the input? Then it would have gone into TempC even though it should have preferred TempB.

          Here is a (very contrived) example:

          <DEFAULT,LS_A,LS_B> SKIP: " ";
          
          <DEFAULT> TOKEN:
              <UNUSED: "unused">
          ;
          <LS_A> TOKEN:
              <A: "a" >
          ;
          <LS_B> TOKEN:
              <B: "b" >
          ;
          
          Start : LS_A :
          	FindABC <EOF>
          ;
          FindABC :
              (<A>)? (TempBB )? FindB
          ;
          FindA :
              <A>
          ;
          TempBB :
            FindBB
          ;
          FindBB : LS_B :
              <B><B>
          ;
          
          TempB :
            FindB
          ;
          FindB : LS_B :
              <B>(<B>)*
          ;

          The intention is that the first two b's are matched separately by FindBB, while the rest of the b's are matched by FindB.

          The old version would never match the first two b's together:

          Dumping the AST...
          <Start (1, 1)-(1, 8)>
            <FindB (1, 1)-(1, 8)>
              b
              b
              b
              b
              b
              b
              b
            EOF

          The new one does:

          Dumping the AST...
          <Start (1, 1)-(1, 8)>
            <FindABC (1, 1)-(1, 8)>
              <FindBB (1, 1)-(1, 2)>
                b
                b
              <FindB (1, 3)-(1, 8)>
                b
                b
                b
                b
                b
            EOF
            12 days later

            adMartem I think there might be a problem with your solution in the latest revision

            Yeah, actually, there must be. I finally noticed that the Java parser was (a bit broken) by that revision. (I mean this commit. Prior to that commit, the generated Java parser (the one in examples/java that, as you doubtless know, is the Java parser that is used internally in CongoCC itself) could parse all the Java source code in the JDK 21 src.zip. All 15,401 files. After that commit, it fails on 5 files. I honestly don't know why.

            I backed out the changes made in the aforementioned commit and now the Java parser is parsing all of the files again. This also means that the issue raised by the original poster is back, which is annoying, though it seems there is an easy workaround.

            I will investigate further.

            Did you figure out anything about this? For the moment, I have to admit that I am quite perplexed by this. Could you confirm that the 0.5% test failures that you mention above are now succeeding?

              hylkevds I spoke too soon about having fixed this issue. I had to revert the commit those changes and it seems that the issue you raised is back, so you would need to apply the workaround that you mentioned (an explicit SCAN statement, I guess).

              I have to admit that, at this precise moment, I don't really understand what is going on with this. I really would like to get to the bottom of this, but it may take a little while....

              7 days later

              Sorry for the tardy reply, Jon. I see that your recent changes affect this behavior, so I will reverse my temporary reversal of your change and see what happens to my .5%.

              revusky Well, I don't think they fix my 0.5%, but it is possible they might have changed the failures. But from my sampling of them they appear to be unchanged. This is the change I can make that makes everything work again.

              The manifestation of the errors is somewhat perplexing, as some seem to affect productions that should not be sensitive to lexical state transitions, but they all stem from when the program uses a COBOL declaration that makes , the decimal point rather than the default .. This is effected by a lexical state change on the BNFProduction for decimal numeric literals, and all of the affected COBOL syntax errors are in statements containing decimal numeric literals, so I suspect that is the connection. I've tried to reproduce the behavior with a simple test case, but so far I've not been successful. Maybe the preceding will trigger some thought for you, however.

                adMartem I think I see the problem. Will provide a fix shortly. I spoke too soon. I thought the problem was that BNFProduction is a subclass of Expansion, but didn't override the Expansion version of startsWithLexicalChange thereby causing its potential true value when lexicalState is non-null to be missed. But that does not appear to fix my parser.

                  adMartem I think the problem is that in a BNFProduction with a lexical state specified, the ExpansionSequence(s) comprising the production's ExpansionChoice child check only their respective child nodes, and not the parent of the ExpansionChoice itself, which could, in fact, be the state change of a BNFProduction. So even if the BNFProduction correctly reports startsWithLexicalChange it is not considered when the ExpansionChoice provides its result. I added code to override startsWithLexicalChange in BNFProduction (which is an Expansion and it seemed like it should not return false in the case of a production-level state change), but that didn't work. My hypothesis above would explain that, I think, because the logic in ExpansionChoice doesn't check its parent (which could be a production). If true, however, I'm not sure putting the check for a BNFProduction parent in the ExpansionChoice version of the method is any better than the check that used to be in the ExpansionSequence. Furthermore, what confuses me about check in ExpansionSequence is how it used to work at all considering that the parent of the ExpansionSequence would in this case be an ExpansionChoice, not a BNFProduction, wouldn't it? I am perplexed. I guess I will try adding a check in ExpansionChoice and see if my hypothesis is correct and, if so, the question is if the old check in ExpansionSequence is better than the rabbit hole I'm digging to "correctly" fix it. I'm inclined to say, "no", the old check should be put back with a comment as to why it is there for this special case.

                  2 months later

                  Just as a status report, this problem still exists for my parser, and I have to apply the small reversion above to make it run correctly. I still haven't been able to come up with a test case that reproduces it.

                    hylkevds
                    Hi, I turned my attention back to this issue you raised and I think it really is fixed now. You see, I had applied a fix before but I later realized that it broke some other things. Specifically, the generated Java parser was now failing on a handful (I think five) files in the src.zip. And also there were the issues that John Bradley (adMartem) was reporting. So, as I said in a separate note, I ended up reverting the change that fixed your issue.

                    The truth is that I never really figured out why my previous fix broke other things, but I finally just identified where the problem must be and just tore up that code snippet and rewrite it a bit differently and it seems to be okay now. Your issue seems to be fixed (again) and the generated Java parser is parsing everything in the JDK src.zip (the latest one, JDK 22 now). I'll be happy if John reports back that everything seems okay now with his issues.

                    I think that, for the most part, the bug you were hitting had an easy workaround. You see, there is a bit of tricky logic where the code generation tries to get away with just a single-token lookahead vs. generating a predicate method. And the single-token lookahead was spurious in the cases where you had to switch lexical states and the logic on that wasn't quite right. But, as I think you realized, you could always force it to generate a lookahead routine by simply putting in an explicit SCAN or up-to-here in the key spot.

                    But I think it's fixed now and it should work without any tricks. But, by all means report back on it.

                    Thanks.

                    adMartem Just as a status report, this problem still exists for my parser, and I have to apply the small reversion above to make it run correctly. I still haven't been able to come up with a test case that reproduces it.

                    I just applied (or re-applied) the above-referenced snippet. It looks fine and, to be honest, if that was there before, I don't know why I removed it. Maybe I was experimenting with different things and I removed that and then forgot to put it back in. I'm not really sure. I rewrote a bit of code in NonTerminal.java and, as I explained to hylkevds, it all seems to be working now, but I honestly am not sure why. I finally just deleted a certain key bit and rewrote it and somehow that got the gremlins out, it seems...

                    I will be quite curious to know what issues you now have with this. (I hope none!) This business of handling lexical state switches correctly is a tad tricky, it seems. Of course, the legacy JavaCC tool simply does not address this problem at all, so....

                      revusky
                      I can confirm that my issue appears to be fixed by your latest commit. Thanks! I will let you know if there is any contrary evidence when I run the entire COBOL regression tests, but I don't expect any. Just to be clear, this means that the unmodified main tip of CongoCC will be used for the COBOL release for Java 17, 21, and 22 I am currently working on and will be tested on those platforms and on MacOS, Linux, and Windows.

                        adMartem Just to be clear, this means that the unmodified main tip of CongoCC will be used for the COBOL release for Java 17, 21, and 22 I am currently working on and will be tested on those platforms and on MacOS, Linux, and Windows.

                        I'm thinking that we should really put up a testimonials page. You (or people generally, really) could contribute some testimonial about how they are using CongoCC and saving all kinds of time (and money) implementing whatever it is... AND... most importantly of all, having fun!!!

                        I mean, I'm guessing that it's not top secret that you're using this and you wouldn't mind writing some positive note about it. Though not just you... whoever is willing to put in a few words, of course.