Over the last few days, I finally implemented this new feature. It's actually not the same as this though it is related, of course. What that blog article from about 4 years ago outlines is that we now have the ability to turn on/off tokens specified in whatever lexical state -- or to get particularly, nerdy about it, that means tokens that are part of the NFA finite state machine corresponding to any given lexical state(s).
This new feature I am announcing is not the same thing as that. You can now define contextual tokens that are not part of the tokenization machinery. The main anticipated use of this is for contextual keywords that are usually just identifiers in the grammar, but in a key spot or two, are interpreted differently -- as keywords of some sort typically. Thus, for example, the contextual keyword yield
(stable language feature as part of Switch expressions since JDK 14) is a keyword at the start of a Yield Statement, but everywhere else, this is just a regular identifier. Presumably, this was so that existing code could continue to run. Similar case is record
. Also, there are sealed
and permits
. Well, simple illustration, the code:
int yield = 1;
boolean sealed = true;
Record record = null;
Case case = someCase;
The first three lines are fine, but the last one above will not compile because case
is a reserved word in the Java language, and can never be used as a regular identifier, while the other three words -- yield
, sealed
, record
-- are only keywords in a very specific context, but otherwise, as here, are just regular identifiers.
CongoCC now has a much more natural solution to this problem with minimal scaffolding. Here is the current implementation of the YieldStatement
production:
YieldStatement :
'yield'
=>|+2
Expression
<SEMICOLON>
;
There is no need to define any separate yield
in the lexical part of the grammar. The way this works is that the string "yield" is matched as an IDENTIFIER
but the contextual token yield
is specified here with single quotes, so the machinery, when checking whether the token matches yield
, sees that yield
is a contextual keyword, so it checks whether the string yield
(matched by the tokenization machienry as IDENTIFIER
) matches the contextual keyword yield
. It does, so it quietly recognizes it as a match for 'yield'
, changing the type
of the lastConsumedToken
from IDENTIFIER
to yield
. This is an ideal case for using this feature since this occurs in this one spot in the grammar and everywhere else the string yield
can only be an IDENTIFIER
.
Perhaps the easiest way to understand this is by contrast with the (shortcut) defining of a token type using a literal string. If you write "yield"
instead of 'yield'
, with no corresponding mention of this in the lexical grammar, what happens is that the string "yield" is added as a regular expression to be matched. (Note, only in the "DEFAULT" lexical state, which could be a gotcha, but this feature mostly exists to support relatively small grammars that might typically only have one lexical state. So you can work up a grammar with minimal scaffolding.)
So, note that:
"yield" <IDENTIFIER>
would NOT match the input yield yield
because we now have a new token type called _yield
that is NOT an identifier. Both of the yield
s in the input are matched as that. But... if we have:
'yield' <IDENTIFIER>
that will match the input yield yield
because the jfirst yield
is matched as a yield
type and the second occurrence of "yield" is an IDENTIFIER
. Or, to put it another way, it is only matched as the yield
type when we specifically mention it. This is a subtle, but crucial difference. Another example is that the choice:
<IDENTIFIER>
|
'yield'
the second choice is necessarily unreachable, since if the coming input was "yield", it would be matched as the first choice, IDENTIFIER
, so it never reaches the second choice. So the only feasible way of writing this (at least to behave as you presumably would want it to) is:
'yield'
|
<IDENTIFIER>
because that is the only way you can ever identify this as the *soft keyword" yield
. In the opposite order, it will always match IDENTIFIER
. So, note the different semantics as compared to the double-quoted "yield"
.
So another little gotcha with these soft keywords is that they only work if there is a more general pattern that will match the string. An interesting case is the new modifiers introduced to support sealed/unsealed type declarations, sealed
and non-sealed
. The first one sealed
can be (and is now) replaced with this kind of contextual keyword. We can just write 'sealed'
and it will match this as an IDENTIFIER
, but since 'sealed'
is a contextual keyword, in the right context, it will check for the string match and then realize that this is not an IDENTIFIER
(in this precise spot!) and change its type to sealed
. But with non-sealed
this does not work, because non-sealed
will not be matched by the IDENTIFIER
pattern (or any more general one). (It is utterly beyond me why non-sealed
was not defined as non_sealed
, since that would surely spare people some headaches. But, okay, it provides some additional challenges, no?) In fact, the input "non-sealed" will match as the sequence of tokens IDENTIFIER MINUS IDENTIFIER
. So, if the machinery has the IDENTIFIER
token "non", it is not (via any magical extra lookahead) going to realize that this is the first part of non-sealed
. Sorry. It just won't work. So again:
For this kind of soft identifier to work, it must be matched by some existing, more general pattern.
That is an interesting little detail, but the fact remains that about 98% of the time anybody would want to use this feature for its intended use-case, there is some sort of general IDENTIFIER
pattern that will match it. (By the way, I intend to put in some warnings for these cases that aren't matchable but that is unimplemented as yet.) Or, another detail would be that if you wanted to use a foreign word, like 'привет' or '你好' as a contextual keyword that is fine, since these are valid Java identifiers and would be matched as such, but if you are in a language where an identifier is only (["a"-"z", "A"-"Z", "0"-"9"])+
just ASCII alphanumeric, then that won't match the aforementioned keywords (in Russian and Chinese respectively) so, again, no dice...
But... I daresay that in most existing real world use-cases, the contextual keyword does match the identifier definition in that language. In fact, that is precisely why you want to define it as contextual so that outside of that specific context, it is just matched as an IDENTIFIER
or whatever more general category!
In the current implementation, the assumption is that the contextual keyword is a valid Java identifier, which is usually the case, and simpler to implement since the TokenType
definition just uses the string itself, which must be a valid Java identifier. That said, it is a tad inflexible and I have already run into a case where it doesn't work! Python 3.10 introduced two soft keywords, match
and case
and, with the current (somewhat crude) implementation, I could use this disposition for the first one but not the second one. Because case
is a reserved word in Java, and the TokenType
enum can't use that as an element. It has to get munged into _case
CASE
or something like that. Details, details... So I will have to do something about that... Meanwhile, for about 98% of the cases where you want something like this, the feature, as implemented, works pretty well. I have it going in internal use and it has already led to quite a significant simplification and improvement in readability where it is used.
Again, the case where you want most to use this is where you have a number of contextual keywords that are just used in a single spot (or two or three) and elsewhere they are just regular identifiers. Some grammar specifications could have literally hundreds of these things and your activeTokenTypes
has an equal number of slots for this and you are turning them on/off as needed... well, it has long seemed to me that this is not a good solution and this newer scheme should really be better. So I finally buckled down and implemented it.
Enjoy.