adMartem I rejected simply saving and restoring the lexical state because,
Yeah, actually, I think you're right to reject it. After sleeping a bit, I realized that it is not really correct to save/restore the lexical state when you activate/deactivate tokens. I mean, there is some possibility that the code inside that expansion could intentionally change the lexical state and...
But I think there is a more general problem. Let's see... First of all, back before I introduced these various things like LEXICAL_STATE SOME_STATE (....)
or just declaring it as part of the production:
SomeProduction : SOME_LEXICAL_STATE :
etc etc
;
Before that existed, you would specify a lexical state change in your lexical grammar.
TOKEN : { <TRIGGER_TOKEN : "..."> : NEW_LEXICAL_STATE }
So once it sees a TRIGGER_TOKEN
, it goes to the NEW_LEXICAL_STATE
. But then, almost always, you want to later switch back to the other lexical state, DEFAULT
, and that is your responsibility. And that tends to be extremely error-prone.
So, let's say you have typically your embedded code.
JAVA_START
some Java code
JAVA_END
When the lexical machinery sees JAVA_START, it goes into the JAVA_STATE lexical state, and then it parses Java code until it sees JAVA_END
TOKEN : {<JAVA_START : "JAVA_START"> : JAVA_STATE}
and then elsewhere:
TOKEN : {<JAVA_END : "JAVA_END"> : DEFAULT}
EmbeddedJavaCode :
<JAVA_START>
(JavaStatement)*
<JAVA_END>
;
Okay, so to be absolutely fair, the above disposition could work in certain circumstances.... BUT it does not work generally. Why? Because it's not compatible with how lookahead works. If you're parsing, you read in the JAVA_START token, you switch to the java state, you then eventually reach the JAVA_END token, and you go back to the DEFAULT state. The problem is that lookahead does not work like that. Lookahead just scans ahead until it hits a problem and then jumps out. So, if this is part of a lookahead, it is going to fail without reaching JAVA_END, and if you're relying on that as your mechanism to go back to the DEFAULT state, it ain't gonna work, and this is just a general problem. This kind of thing does not generally work in conjunction with lookahead for the simple reason that lookahead typically jumps out before it reaches the closing trigger token. Even it could be successful if it was an explicit numerical lookahead. It jumps out (and this is a successful lookahead) because it scanned ahead the n tokens it was supposed to and that's it. Actually, come to think of it, a lookahead jumps out far more often than it scans to the very end!
Well, the bottom line finally is that, in order to work reliably, the generated code pretty much has to be a try-finally construct, generating something like:
<store current lexical state>
<switch to new lexical state>
try {
succeed OR fail at parsing/scanning the expansion
} finally {
<restore previous lexical state>
}
And actually, that's what try-finally is for! If you jump out early (either by regular control flow or by just throwing an exception) there is an iron-clad guarantee that the finally block is executed. (And every nested finally block inside is executed as well, in the appropriate innermost to outermost order!)
And, you know, modulo whatever initial implementation glitches, this approach really should work robustly, both in parsing and lookahead. Now, a funny thing is that, when I later implemented the token activation/deactivation, on the first pass, I just implemented it naively. I just had something like:
ACTIVATE_TOKENS (FOO, BAR)
And then, presumably, it was on you to deactivate them manually later, with:
DEACTIVATE_TOKENS (FOO, BAR)
I later realized that this suffered from much the same problem as switching lexical states did, and really, could only work in a robust manner if it was bounded. I mean like:
ACTIVATE_TOKENS FOO, BAR ( some expansion )
And then, as in the case of the lexical state change, we should have the iron-clad guarantee that the set of active token types is the same before/after since, again, it gets translated into a try-finally in which the finally block restores it to the previous state. And again, it doesn't matter how many of these things are nested within one another, because when the whole execution stack gets unrolled, all the finally blocks get invoked.
So, what is my point here? I think your problem fundamentally may be that you are mixing old-style lexical state switching with the new style. If you could refactor your code to exclusively use the new style of lexical switching, then most likely your problems would go away.
Am I wrong about that? (I could be, but...)