LexicalState is possibly misapplied after ACTIVATE_TOKENS

adMartem · Nov 26, 2022

I've run into the following behavior, and I'm not sure if it is normal or a bug. If you are scanning and match a token that includes a lexical state change, the change takes place on the next token scanned (as expected). If the grammar, however, then uses ACTIVATE_TOKENS to activate a new token in one of its rules, the parser saves the current active token set (as expected), adds the new token(s), and then throws away all of the tokens following the currentLookaheadToken (as expected). It then scans using the new token set beginning at the offset following the currentLookaheadToken (as expected). However, it does so with the lexical state set by the now discarded token, re-tokenizing one or more token using, in effect, the wrong lexical state. It would seem that the principle of least surprise would dictate that this should not happen, probably by the parser doing something like what the LEXICAL_STATE directive does, provided someone (i.e., the Lexer) keeps the lexical state following each parsed or scanned token.
So my question is, is this a bug, or should the user of the ACTIVATE directive explicitly set the LEXICAL_STATE to what it should be following the last scanned or parsed token. The latter seems to me to be rather messy at best, maybe impossible, requiring the author to at least potentially know more about the parsing mechanism than he should.

I can, for my particular case of course, in the expansion using the activated token use an ASSERT to check the lexical state and fail the lookahead if it is not one that the token is in, I think.

revusky · Nov 27, 2022

adMartem

Gee, I'm not sure I understand the issue. I would probably need some sort of example that illustrates this. (I'm not saying that there's no problem with this, there quite possibly is, but I would need a concrete example.)

Offhand, it is hard to understand the issue because the whole question of activating/deactivating token types is basically orthogonal to the lexical state you are in. Let's see... what I mean by that is that when you activate/deactivate a token type at some juncture, there is no real interaction there with what lexical state you happen to be in. Or, to put it another way, there is no sanity check that this token type you are activating or deactivating is even part of that lexical state. Any lexical state is a state machine that reads in character input and spits out tokens. There are some token types that it will never spit out because they simply don't exist in that lexical state, so effectively activating/deactivating a token type that is not part of a given lexical state just has no effect.

So, it's this NFA algorithm (that I never found a decent readable explanation of anywhere) that I finally re-implemented (which is how I finally came to understand the damn thing) which keeps reading in characters one by one until it can't any more. (Longest match wins.) At any iteration of the loop, there is an input character (obviously) and a set of NFA states that are active. An NFA state will either a. do nothing, b. activate one or more NFA states for the next iteration. c. match a given token type. (It can do both b and c actually.) Now, all that token activation/deactivation introduces to the mix is that it will only do c. if the token type is in the set of active types. So, to get very concrete about things, here is the functional interface that defines an NFA state basically:

   static interface NfaFunction {
      TokenType apply(int ch, BitSet bs, EnumSet<TokenType> validTypes);
   }

So, when I first implemented this big refactoring, this apply method only had 2 parameters, the input character ch and the BitSet, which is variable used to keep track of which NFA states are going to be valid on the next iteration. (The loop will stop on the next iteration if that BitSet is empty.) Then it occurred to me somewhat later that there was this really low-hanging fruit. I could add a third parameter, the validTypes and when we match a token type, we just do a quick (and it's very quick) check to see whether that type is in the set of valid types. So you see this here and here

And in the main lexer loop, it now just passes in that activeTypes as the third parameter. See here.

Well, I just outline all this in such excruciating detail out of a certain pride of craftsmanship, but also in the hopes that you (or anybody) can incrementally start seeing how the code works. But, in the context here, there is a point, which is that the activation/deactivation of token types is so simple really that it's actually kind of hard for any bugs to live there. But hopefully, you see what I mean when I say that token activation/deactivation and lexical states are actually orthogonal. The token activation/deactivation machinery has no "knowledge" of which tokens are part of which states. And really, it doesn't need to...

Well, some bugs did live there (and maybe still do) but that would surely relate to token caching. If the whole thing was really bloody minded and never cached tokens, then it really is hard to see where anything would go wrong. (Except for things being massively slower because you're constantly re-tokenizing the same input.) The problem exists of caching tokens that you created on one lookahead path and then re-using them (spuriously) on some other path. This, by the way, was one outrageous longstanding bug in the legacy JavaCC. You could scan forward on some lookahead in whatever lexical state and then if your lookahead failed and you really need to throw away all these cached tokens, because you're in a different lexical state, then guess what... it wouldn't do it. So, generally speaking, there was a sort of impedance between using lexical states and lookaheads that, I believe, is basically solved in JavaCC 21. (Sorry to be so boastful...)

Well, I think there are two key points where this whole problem of cached tokens is dealt with, here and here

Well, anyway, those are the key points to look at if you really want to narrow in on whatever the issue is. In other matters, I fixed the nested lookahed problem that I described here. I'll have to write up an explanation of that. Also, I need to get Vinay to adjust the Python/CSharp templates to reflect this. Maybe I could do it myself, but I think it would be better if he did it, so as to maintain/enhance his understanding of the machinery. But I need to write up an explanation of the under-the-hood changes.

adMartem · Nov 27, 2022

I think the easiest way to describe it is to use this commit https://github.com/javacc21/javacc21/compare/master...adMartem:javacc21:master
which I used to fix the problem for my testing. I'm not sure it is the right solution, nor the complete solution, but I think it clarifies the problem being solved.

And, thanks for the NFA description.

As you can see, the problem is not with the newly activated token, but rather with the re-tokenizing leading up to the activation.

adMartem · Nov 27, 2022

In my case, the following was being parsed:

PROGRAM-ID. test-java-block.
       WORKING-STORAGE SECTION.
       77 x pic 9.
       PROCEDURE DIVISION.
       1.
           display x
           java
               assert (1 == 1); 
           end-java.           
       2.
           move 1 to x x x
           java
               assert (1 == 1); 
           end-java.           
       3.
           continue
           java
               assert (1 == 1); 
           end-java.                      
       exit-test.
           stop run.

In paragraph 1, the display statement can take two forms, one of which includes a production like this:
At : ACTIVATE_TOKENS AT (<AT>);
However, the java ... end-java is parsed in the JAVA_STATE, introduced by a lexical state action on the java token. So what happens is the java token is recognized, the state is changed, and the At production is visited to see if the display statement has more stuff. As a result, everything after the x (including whitespace) is tokenized in the JAVA_STATE, not the DEFAULT state. So when the java is re-processed it becomes a JAVA_IDENTIFIER token instead of a JAVA token and hilarity ensues.

Paragraphs 2 and 3 work fine, because there is no At production involved in their parsing.

revusky · Nov 28, 2022

adMartem I think the easiest way to describe it is to use this commit https://github.com/javacc21/javacc21/compare/master...adMartem:javacc21:master
which I used to fix the problem for my testing. I'm not sure it is the right solution, nor the complete solution, but I think it clarifies the problem being solved.

Well, there's a basic, first-order problem here, which is that, this may have solved your immediate problem but it introduced other problems. Just type:

  ant test

and you'll see that it now fails to parse 7 files (out of 137) in the CSharp testing. You will further see (though I may as well save you the bother of figuring it out) that all the failures are on the same construct, which is these interpolated strings, that are, quite frankly, a bear to deal with.

The part of the CSharp grammar that is now not working is here and this is, in fact, a part that makes heavy use of lexical state switches.

But, in any case, you see that the 7 failures are all on this construct. For example, here

I can't tell you offhand why your patch caused this part of the CSharp grammar to stop working, but it's bound to be instructive to figure out why! (The above is the only test breakage, I think. Everything else seems to work.)

revusky · Nov 28, 2022

adMartem

I was wondering (not sure at all) whether simply saving/restoring the lexical state in the case of token activation/deactivation would address your problem.

See here

The funny thing is that I wrote the HandleLexicalStateChange macro to deal with lexical state changes (as the. name implies) in order to resolve that longstanding issue, but I did not implement the token activation/deactivation until quite a bit later, but I then patched the macro, adding the extra case of handling whether the expansion had token activation. At that point, it didn't occur to me that there was any need to save/restore the lexical state in this case. But, it certainly doesn't hurt. All tests are passing, and it certainly doesn't really cost anything, so...

Looking at your changes, I understand the thinking behind it. At a certain point, I recall that I even had some experimental branch where tokens knew which lexical state they came from, so, like you, I had an extra lexicalState field in the Token template. I was thinking in terms of rolling back to a prior state and it seemed to me that to be able to handle this, tokens should "know" what lexical state they came from. Or it would be useful if they did.... At a later point, I rejected the idea and tore out that code. Finally, it seemed a bit heavy-handed to me and the notion that tokens stored this information about which lexical state they came from (i.e. which NFA state machine spawned them) seemed to give off a funny smell. Though maybe only slightly. It's not impossible that I would go back to that idea.

I don't understand why your patch caused the interpolated C# strings to break. But I won't investigate that right now. I'm going to try to get some sleep instead. It's much earlier where you are. Maybe you'll figure it out.

adMartem · Nov 28, 2022

revusky I was wondering (not sure at all) whether simply saving/restoring the lexical state in the case of token activation/deactivation would address your problem.

I rejected simply saving and restoring the lexical state because, by the time the state is saved, it has already been changed to the JAVA_STATE. In my case, the very token that changed the state is the token being examined in the newly determined activeTokens (of course it is not found there). I ended up putting the state in the Token because I was afraid to make any assumptions about how far back the lexer.reset(Token[,LexicalState]) methods might be used for (possibly back over multiple state changes?). It seemed only correct that the proper lexical state should be in play no matter how far back it was asked to go.

I too noticed the CSharp failures, and ignored them, since I was just trying at the time to prove my hypothesis regarding the cause of the crazy tokens I was seeing. Heavy handed also, as you mentioned. When I saw your reply I was in the process of determining if my change has broken any of my tests. After I've determined that, I'll take a look at the C# failures too.

adMartem · Nov 28, 2022

OK, I'm stumped as to the C# failures. I can't see how my changes would affect anything but (DE)ACTIVATE_TOKENS, and the only times it uses those are miles away from where it fails. The horrible thought is that my changes affect the javacc parser which in turn affects the C# parser, but my cursory look at that also couldn't find any likely candidates.

I'm sleepy now too, so I'll check tomorrow.

revusky · Nov 28, 2022

adMartem I rejected simply saving and restoring the lexical state because,

Yeah, actually, I think you're right to reject it. After sleeping a bit, I realized that it is not really correct to save/restore the lexical state when you activate/deactivate tokens. I mean, there is some possibility that the code inside that expansion could intentionally change the lexical state and...

But I think there is a more general problem. Let's see... First of all, back before I introduced these various things like LEXICAL_STATE SOME_STATE (....) or just declaring it as part of the production:

  SomeProduction : SOME_LEXICAL_STATE : 
      etc etc
 ;

Before that existed, you would specify a lexical state change in your lexical grammar.

  TOKEN :  { <TRIGGER_TOKEN : "..."> : NEW_LEXICAL_STATE }

So once it sees a TRIGGER_TOKEN, it goes to the NEW_LEXICAL_STATE. But then, almost always, you want to later switch back to the other lexical state, DEFAULT, and that is your responsibility. And that tends to be extremely error-prone.

So, let's say you have typically your embedded code.

   JAVA_START
      some Java code
   JAVA_END

When the lexical machinery sees JAVA_START, it goes into the JAVA_STATE lexical state, and then it parses Java code until it sees JAVA_END

   TOKEN : {<JAVA_START : "JAVA_START"> : JAVA_STATE}

and then elsewhere:

   TOKEN : {<JAVA_END : "JAVA_END"> : DEFAULT}

   EmbeddedJavaCode : 
         <JAVA_START>
           (JavaStatement)*
        <JAVA_END>
   ;

Okay, so to be absolutely fair, the above disposition could work in certain circumstances.... BUT it does not work generally. Why? Because it's not compatible with how lookahead works. If you're parsing, you read in the JAVA_START token, you switch to the java state, you then eventually reach the JAVA_END token, and you go back to the DEFAULT state. The problem is that lookahead does not work like that. Lookahead just scans ahead until it hits a problem and then jumps out. So, if this is part of a lookahead, it is going to fail without reaching JAVA_END, and if you're relying on that as your mechanism to go back to the DEFAULT state, it ain't gonna work, and this is just a general problem. This kind of thing does not generally work in conjunction with lookahead for the simple reason that lookahead typically jumps out before it reaches the closing trigger token. Even it could be successful if it was an explicit numerical lookahead. It jumps out (and this is a successful lookahead) because it scanned ahead the n tokens it was supposed to and that's it. Actually, come to think of it, a lookahead jumps out far more often than it scans to the very end!

Well, the bottom line finally is that, in order to work reliably, the generated code pretty much has to be a try-finally construct, generating something like:

  <store current lexical state>
  <switch to new lexical state>
  try {
     succeed OR fail at parsing/scanning the expansion
  } finally {
     <restore previous lexical state>
  }

And actually, that's what try-finally is for! If you jump out early (either by regular control flow or by just throwing an exception) there is an iron-clad guarantee that the finally block is executed. (And every nested finally block inside is executed as well, in the appropriate innermost to outermost order!)

And, you know, modulo whatever initial implementation glitches, this approach really should work robustly, both in parsing and lookahead. Now, a funny thing is that, when I later implemented the token activation/deactivation, on the first pass, I just implemented it naively. I just had something like:

 ACTIVATE_TOKENS (FOO, BAR)

And then, presumably, it was on you to deactivate them manually later, with:

 DEACTIVATE_TOKENS (FOO, BAR)

I later realized that this suffered from much the same problem as switching lexical states did, and really, could only work in a robust manner if it was bounded. I mean like:

  ACTIVATE_TOKENS FOO, BAR ( some expansion )

And then, as in the case of the lexical state change, we should have the iron-clad guarantee that the set of active token types is the same before/after since, again, it gets translated into a try-finally in which the finally block restores it to the previous state. And again, it doesn't matter how many of these things are nested within one another, because when the whole execution stack gets unrolled, all the finally blocks get invoked.

So, what is my point here? I think your problem fundamentally may be that you are mixing old-style lexical state switching with the new style. If you could refactor your code to exclusively use the new style of lexical switching, then most likely your problems would go away.

Am I wrong about that? (I could be, but...)

revusky · Nov 28, 2022

adMartem OK, I'm stumped as to the C# failures.

I was looking at it around 2 a.m. my time and I couldn't fathom it. That interpolated string stuff in C# is really the heaviest use of this sort of new-style lexical state switching.

It occurred to me that there is something you could find useful that I never documented anywhere. You can see it being used here

ShiftExpression :
   AdditiveExpression
   [
     // Under certain conditions we scanned forward
     // and (incorrectly) scanned a >>, so we uncache
    // the tokens and end up rescanning!
     SCAN ">" (">"|">>") => UNCACHE_TOKENS
   ]
   (
      ("<<" | ">>" | ">>>")
      AdditiveExpression
   )*
;

This UNCACHE_TOKENS is an instruction I introduced because, at whatever key juncture, the thing had cached tokens and my automatic logic for when to retokenize was failing. In the above, we're in an arithmetic expression, so there is some possibility that it has scanned ahead and tokenized >> as two consecutive > tokens instead of a single >>. So we can just tell it manually to throw it away and retokenize. Or, in other words, you can finally just manually shake out any gremlins by resorting to this. And, actually UNCACHE_TOKENS is just the same as:

 {token_source.reset(getToken(0);}

But I figured I'd make it an instruction in the grammar. Also, in principle, it would work independently of the target language. So, you can use this to just manually throw away any cached tokens after lastConsumedToken or currentLookaheadToken (if we're in a lookahead). That use in the Java grammar is actually the only place it is used in the entire project. I can't remember now whether I had to use it in some other case, but then finally restructured things so that it wasn't necessary.

adMartem · Nov 28, 2022

revusky Am I wrong about that? (I could be, but...)

No, I think you are absolutely correct. My first instinct was that "I should really just change to using the new parser directed lexical state change feature" (I already use it in several places, but I also use the lexer-based mechanism). But I didn't for two reasons: there is a cobol feature that implements a switch to a comma decimal point instead of the period and I was trying to make the switch to CongoCC with minimal non-essential, non-cosmetic changes to reduce the effort to synchronize the old JavaCC and new CongoCC parsing if I had to make changes to the grammar for unrelated reasons. The "DECIMAL IS COMMA" thing, while the semantics are confined to a single production (ProgramUnit), the effects are scattered among many productions since the ","/"." switch is only performed in some contexts. For example,

<DEFAULT, DECIMAL_IS_COMMA> TOKEN [IGNORE_CASE] :
{
    < ENVIRONMENT_DIVISION:
        <ENVIRONMENT>
        (<SPACE_SEPARATOR> | "-")
        <DIVISION>
        (<SPACE_SEPARATOR>)?
        (( <DOT> | <DOTS> ) (<SPACE_SEPARATOR>)?)+
    > {isPastIdDivision = true;}
    | 
    < DATA_DIVISION:
        <DATA>
        (<SPACE_SEPARATOR> | "-")
        <DIVISION>
        (<SPACE_SEPARATOR>)?
        (( <DOT> | <DOTS> ) (<SPACE_SEPARATOR>)?)+
    > {isPastIdDivision = true;}
    | 
    < PROCEDURE_DIVISION:
        <PROCEDURE>
        (<SPACE_SEPARATOR> | "-")
        <DIVISION>
    > {isPastIdDivision = true;}
}
<DEFAULT> TOKEN [IGNORE_CASE] :
{
    < DECIMAL_POINT_IS_COMMA:
        <DECIMAL_POINT>
        (<SPACE_SEPARATOR>)?        
        (
          <IS>      
          (<SPACE_SEPARATOR>)?
        )?
        <COMMA>
    > 
    {
        if (isPastIdDivision) {
            setDecimalComma(true);
            toDecimalIsComma();
        }
    }
}
< DEFAULT > TOKEN [IGNORE_CASE] #NumericConstantToken :
{
    < NUMERIC_CONSTANT: (
                            (
                                (["+"]|["-"])?
                                (
                                    (["0"-"9"])+
                                    ["."]
                                    (["0"-"9"])+
                                    |
                                    ["."]
                                    (["0"-"9"])+
                                )
                                (
                                    ["E"]
                                    (["+"]|["-"])?
                                    (["0"-"9"])+
                                )?
                                |
                                (["+"]|["-"])?
                                (["0"-"9"])+
                                (["."])?
                                ["E"]
                                (["+"]|["-"])?
                                (["0"-"9"])+
                                |
                                (["+"]|["-"])
                                (["0"-"9"])+
                            )
                        )  >
}
< DECIMAL_IS_COMMA > TOKEN [IGNORE_CASE] #NumericConstantToken :
{< NUMERIC_CONSTANT_COMMA:
                        (
                            (
                                (["+"]|["-"])?
                                (
                                    (["0"-"9"])+
                                    [","]
                                    (["0"-"9"])+
                                    |
                                    [","]
                                    (["0"-"9"])+
                                )
                                (
                                    ["E"]
                                    (["+"]|["-"])?
                                    (["0"-"9"])+
                                )?
                                |
                                (["+"]|["-"])?
                                (["0"-"9"])+
                                ([","])?
                                ["E"]
                                (["+"]|["-"])?
                                (["0"-"9"])+
                                |
                                (["+"]|["-"])
                                (["0"-"9"])+
                            )
                        )  > 
                       /*
                       {
                            if (matchedToken.specialToken == null) {
                                if (image.charAt(0) == ',') {
                                    if (getLastToken().getType() != LPARENCHAR) {
                                        //reportError(ErrorCode.AMBIGUOUS_COMMA);
                                        System.out.println("ambiguous comma");
                                    }
                                }
                            }
                       } 
                       */

}

and things like

< SQL_STATE > TOKEN [IGNORE_CASE] #SqlReservedWordToken : 
{
    <ESQL_ALL: "ALL">
|   <ESQL_ALTER: "ALTER">
...
|   <ESQL_FOUND: ("FOUND" (<ESQL_DO>)?)> { toDefault(); }
|   <ESQL_NOTFOUND: ("NOTFOUND" (<ESQL_DO>)?)> 
{ toDefault(); }
}

The toDefault() returns the lexical state to either the DEFAULT or the DECIMAL_IS_COMMA state, depending on whether or not the DECIMAL IS COMMA statement was present in the COBOL source program.

But, I agree, my problem in this area stems from the interference between the lexer changing state and the parser doing so also. I thought my change of always restoring the lexical state after a token to the lexer-determined state upon reset would be an innocuous way to solve the problem. However, now as I say that, I realize it might not be true. What if the token being reset to was a token that was just before a parser-determined state change, and the tokens following it were not actually formed in the lexical state determined by the lexer? Something like the "{" that precedes the state change in the PlaceHolder production in C#. I should check to see if any ACTIVATE_TOKENS or DEACTIVATE_TOKENS occurs in a lower level lookahead. If so, that might be the cause of the failure, as the first token after the "{" would be re-lexed in the IN_REGULAR_INTERPOLATION state instead of the CSHARP state!
But how can that happen without reprocessing the LEXICAL_STATE CSHARP logic?

adMartem · Nov 28, 2022

~~Maybe my change should have preserved the "pre" lexical state in the lexer instead of the "post" lexical state?~~
Bad idea, don't even know what it means.

adMartem · Nov 28, 2022

Curiouser and curiouser. Here is my way of programmatically switching state in the lexer, given I want to return to the default state, but I don't statically know if it is the DEFAULT state or the DECIMAL_IS_COMMA state:

public void toDefault() { 
        // set next lexer state to either DEFAULT or DECIMAL_IS_COMMA (can be overridden by expicit state in rule)
        if (isDecimalComma) {
            switchTo(LexicalState.DECIMAL_IS_COMMA);
        } else {
            switchTo(LexicalState.DEFAULT);
        }
 }

I realize now that my "fix" will not generally work with this kind of state switch.
as in my earlier example

...
|   <ESQL_NOTFOUND: ("NOTFOUND" (<ESQL_DO>)?)> 
{ toDefault(); }
}

revusky · Nov 29, 2022

adMartem The "DECIMAL IS COMMA" thing,

Well, just offhand, the (possible) treatment of the comma as a decimal separator seems like a good use case for the token activation/deactivation. In legacy JavaCC, one would find oneself defining a completely separate lexical state where everything is the same except one or two tokens. One really balks at that when one knows how this is implemented! It just generates all the code for a completely separate state machine in which everything is the same except for that one token! So, given the implementation, having separate lexical states that only differ by one token produces huge code bloat. And actually, this probably accounts for the majority of cases where people were hitting the "code too large" issue in the legacy JavaCC.

I guess it ends up looking like this:

SCAN 0 {doWeTurnOnCommaAsDecimal()} => ACTIVATE_TOKENS DECIMAL_IS_COMMA (ProgramUnit)
|
ProgramUnit

It occurs to me that maybe there is some refinement in order for the ACTIVATE_TOKENS construct. It has occurred to me that it could be be attractive to more easily express activation/deactivation, as in:

   ACTIVE_TOKENS +DECIMAL_IS_COMMA -DECIMAL_IS_DOT (ProgramUnit)

As things stand, I guess you have to write:

   ACTIVATE_TOKENS DECIMAL_IS_COMMA 
  (
         DEACTIVATE_TOKENS DECIMAL_IS_DOT 
         (
             ProgramUnit
        )
  )

Or you could write it on one line, of course, but it's overly verbose regardless. So, that's a possible refinement. Another possibility would be to be able to include the condition in the statement, so maybe be able to write:

  ACTIVE_TOKENS 
   {doWeTurnOnCommaAsDecimal()} 
     +DECIMAL_IS_COMMA -DECIMAL_IS_DOT
  (
        ProgramUnit
  )

So then we could express things more economically than what I wrote up top. How would you like that?

Of course, none of that adds functionality. It just streamlines the syntax. (I'm always thinking about how to streamline the syntax.)

revusky · Nov 29, 2022

adMartem but I don't statically know if it is the DEFAULT state or the DECIMAL_IS_COMMA state:

Well, the problem is that, generally, when you trigger the end of a lexical state, you want to revert to the lexical state ex ante -- which, in a complex grammar, might not always be the same. What you really want (in the general case) is a stack and to pop the stack to revert to the previously used lexical state. And, again, the lexical state switch specified at the parser level does this transparently.

And, on reflection, one starts intuiting why that corner of the C# grammar was broken, because you must have somehow been interfering with the machinery by which it would pop back to the previous lexical state, right?

I assume that must be the issue. I haven't looked at it any further...

adMartem · Dec 1, 2022

revusky
I think so. There must be somewhere, when re-tokenizing (for activation or deactivation of tokens), it relies on the lexical state to not be the one the previous token transitioned to.