Premature lexer state change

adMartem

Another one from my Cabinet of Curiosities ...

In the grammar I have occasion to consume tokens up to a specific end token without regard to any expansion. For example something like this:

SelectStatement : <ESQL_SELECT> =>|| {consumeSqlFragment(0,END_EXEC);}# | FAIL

[BTW, the | FAIL is to cause this to be a choice point so as to always perform the Java action in lookahead in the context of a higher-level up-to-here]

What it tries to do is, at lookahead time, scan the ESQL_SELECT token and then execute the consumeSqlFragment (...) method, which is coded to scan tokens in lookahead and consume them in regular parsing. In this case it starts at the ESQL_SELECT token and scans subsequent tokens until it gets to an END_EXEC token (which it does not scan). That all seems to work fine. The wrinkle is that the END_EXEC token specifies a state change to DEFAULT (from SQL_STATE). Apparently, from looking at the NFA code, merely peeking at the END_EXEC token causes the state change to occur. Then, when actually consuming the tokens after the lookahead has succeeded, the parser fails to find the ESQL_SELECT token (because the state has been changed by the lookahead peeking at the END_EXEC) and finds the (DEFAULT state) SELECT token instead. This causes higher-level production to fail.

It seems like the lexer/parser should effect the actual state change only when the token that changes it is scanned or consumed, or should I be doing something similar to what the stashParseState/popParseState does in ATTEMPT/RECOVER to reset the previous lexer state after I do a getNextToken() that isn't followed by a scan or consume?

revusky

adMartem

I just noticed this post. I'm a bit confused because it sounds like the ESQL_SELECT is getting re-tokenized in another lexical state and therefore... but if it was already tokenized as ESQL_SELECT, shouldn't the token be cached?

Well, I guess the question is, what does your consumeSQLFragment() method look like?

adMartem

I'm still trying to work around this one. When I do I should have a better notion of what, if anything, is wrong with state changing for my use case. Yesterday I concluded that I should be discarding the "peeked" state-changing token and restoring the previous state, but so far I haven't gotten that to work. More later.

adMartem

This is tricky to work around. What almost works is to (in my consumeSqlFragment) keep track of the lexical state after every token scanned, and they when leaving (having encountered the END_EXEC that triggers a state change) restore the state to that after the last truly-scanned token. When consuming (not scanning), the restore at the end is skipped, leaving the state as is. It doesn't work because the lexer only effects the state change when it first recognizes the token, so it ends up leaving the state in SQL_STATE not DEFAULT as the END_EXEC specifies, and then the parse fails on subsequent expansions since they are expecting DEFAULT state behavior. I can probably make it work by saving the changed state the first time it exits (i.e., the first time is scans using the method) and then using that to restore the lost state change after consuming the previous token. It will be really ugly though, and very fragile, since that logic would not work if the method were being used from several different expansions within the same overall scan (I don't). And probably many other flaws when I think about it. The reason I can't simply "doLexicalStateSwitch" when I am consuming is because in my parser, all of the state changes back to DEFAULT are done indirectly by a lexical action ("toDefault()") in order to dynamically choose between DEFAULT and DECIMAL_IS_COMMA as the default state. This begs the question of how to "reverse" the effect of a lexical action, in general. Luckily, I know how to do it in this instance.

It seems like there should be an elegant solution that would accommodate what I am doing without breaking existing uses of lexical state and, especially, lexical actions, but at this time I can't think of it. I think it is complicated by questions regarding the interaction of activating and deactivating tokens (and hence re-accepting tokens) and potentially performing the state change and lexical action multiple times.

So for now, I am going to go the ugly route. If I have any ideas along the way, I'll let you know.

revusky

adMartem

Most likely the real solution to your problem is to move the lexical state switch logic from the lexer to the parser. Did you ever come across this essay, I wonder? https://javacc.com/2021/01/24/context-sensitive-lexical-states/

That is from well over a year ago, but the funny (not ha-ha funny) is that it wasn't really working properly when I wrote that. At least not fully. There were various glitches that I nailed when I finally needed this stuff doing the Csharp grammar about a year later! As you might be aware, that CSharp is one nasty grammar to write, what with interpolated strings and stuff like that. Actually, I don't think that legacy JavaCC is powerful enough to write a decent CSharp grammar.

It could be useful to make some study of how the CSharp grammar I wrote deals with some of this stuff. In particular, just look at https://github.com/javacc21/javacc21/blob/master/examples/csharp/CSharp.javacc#L2255 until the end of the file, the various uses of the LEXICAL_STATE directive there. By the way, the last two productions in the grammar are written differently, mostly just to have a certain bit more test coverage.

But I guess the essence might be that if you have a production that is considered to take place in a given lexical state, like the ubiquitous FOO, let's say, you could have:

  SomeProduction : FOO : blah blah ;

So we say that SomeProduction is in the FOO lexical state, then

    SCAN SomeProduction => DoSomething

should handle most of the housekeeping because it switches into FOO and out of FOO transparently, and thus, handle the messy details or banana peels you're tripping up on. (It should, but this is one of the less tested parts of the code base, so you could do the project some good by really putting it to a stringent use test.) But anyway, if you study the final 40-odd lines of the CSharp grammar, you might find some ideas for handling this elegantly. Or you could wait until, one of these years, this lexical state machinery stuff is properly documented in a proper manual.... (Well, we are working on that sort of thing, but...)

adMartem

revusky
I'm embarrassed. I remembered the essay, but at the time I read it (early in my CongoCC era), I decided it didn't apply to me due to my use of lexical actions combined with explicit state changes to manipulate the lexical state. What I didn't realize was that all of my complexity in that area was the result of my earlier (10 years ago) wandering in the JavaCC mines of Moria in the area of lexical state you highlight. I even remember being puzzled, and then scared off by the "usually a bad idea" verbiage. Using the new CongoCC features I can eliminate all of the, dare I say, crufty code in that area. My SQL consumer method is still relevant, but it works perfectly in its original form in conjunction with state changes at the non-terminal declaration and expansion level. I really do agree with you in regard to the location of lexer state switch logic. At least in my case, 99% of the state changing is properly triggered by the parser/grammar, and by doing so I believe the recovery logic can be made much better and more reliable.

I never liked Kansas much, anyway.