Making the Lexical Grammar more Flexible (and some related ideas)

revusky

From even the FreeCC days (late 2008) one could redefine a syntax production in a grammar, typically using the INCLUDE mechanism.

However, for some reason, it was never permissible to similarly redefine a token definition. Suppose you wanted to use the standard Java lexical grammar, which has:

  <NULL : "null">

but for some reason you wanted to let people write "None" (like in Python) as well. So you would think that if you wrote:

 INCLUDE "JavaLexer.javacc"

and below that:

  <JAVA> TOKEN  :   <NULL : "null" | "None"> ;

that this would override the definition of NULL from the included file and get the desired result. But it never did work that way. It would always just complain about a "multiply defined token" and abort, since this was considered to be a fatal error. To tell the truth, I wasn't even sure whether this did work or not, I had to check. But no, it didn't work.

Well, I suppose anybody reading this is guessing (correctly) that it now does work! It was actually trivial to get this working. The reason I turned my attention to this is that I have a line of communication open with the developers of this Beanshell project project. That is a quite old project that is basically a scripting/prototyping language based on Java syntax. So, in principle, it has some of the same goals as Jython or JRuby except that the beanshell syntax is meant to be much more like Java syntax. The project uses a JavaCC (JJTree actually) grammar that is based on the Java grammar that was part of the JavaCC package back when. Actually, I think their grammar is currently somewhere between JDK 4 and 5. (Though note that it actually does have some extra things, notably more flexible declaration of arrays, incorporating "slices" that come from Python. And some other things...)

The Beanshell project has an interesting history. It looks like it was basically dormant for nearly 15 years from 2004 to 2019 and there has been an effort to resuscitate it. But, finally, some days ago, I made them an offer they couldn't refuse (or one would think they couldn't!) I told them that I would convert the project to using JavaCC21/Congo if they would commit to using it. (Obviously, if I did this for them and then they didn't use it, that would be....) But I would do the work for them. And it seems that they are taking me up on it.

So, anyway, in terms of the topic here, redefining elements in the lexical grammar, you see here in their grammar they have some alternative ways of writing certain operators. For example, instead of < you can alternatively write @lt for example. I assume the main motivation for this had to do with embedding Beanshell code inside of HTML or XMLish files and the confusion between certain operators with the pointy brackets... I assume that this is something that has been in the Beanshell language for a long time, because @lt could be an annotation in principle. (If you had an annotation that was lt which would be not so common admittedly.) But I am guessing they decided on these alternatives before JDK 5, when annotations were introduced. But anyway, it would make sense if you could write some alternative form instead of < or >, so in principle, if you could redefine GT as:

  <GT : ">" | "@gt">

then the syntax in the included Java grammar could continue to work except that it would also accept the alternative form for that token. And this is when I realized that you can't do that. (Or you couldn't. Now you can!)

So this got me thinking about a couple of things. First of all, JavaCC (and this originates in the legacy JavaCC) allows you to just use a string literal instead of the token label in your grammar. So, for example, you can write:

  WhileStatement : "while" "(" Expression ")" Statement ;

instead of:

  WhileStatement : <WHILE> <LPAREN> Expression <RPAREN> Statement ;

I tend to write the former rather than the latter because I find it a bit clearer to read. Of course, there are the typical arguments in favor of externalizing strings. For example, in theory, it is preferable to write:

  public static final String ACCOUNT="account";

rather than use the literal string "account" everywhere, because, for one thing, if you were later going to build a localized version and wanted that string to be "compte" or "cuenta" or "Konto", you would only have to change it in one place. And the other reason, I suppose, is that if you were going to misspell it and write ACOUNT instead, the compiler would tell you that there is no such variable as ACOUNT, but if you misspell it in a quoted string, the compiler would not complain at all. It is also true that in the JavaCC context, if you wrote "whil" instead of "while", it would not complain. It would just define a new token type corresponding to that literal string. Of course, the resulting parser would not parse hardly any Java code successfully, but maybe it would take a bit longer to figure out the problem. Whereas if you wrote <WHIL> instead of <WHILE> it would immediately complain that there is no token type called WHIL defined. There is that, but still, overall, the pros and cons of using the literal string as opposed to the token label look pretty marginal to me -- this is the case particularly if it isn't even ever possible to redefine the token definition anyway.

But... once you can redefine a token type, it seems like it could be better style generally to externalize all these literal strings, because, well, maybe somebody wants to INCLUDE your grammar but then tweak certain things, like:

  <WHILE : "while" | "whilst">

I suppose that is rather far-fetched, but even so, the fact that somebody could do that and, in the other case, they couldn't... One could imagine things like:

  <COLOR : "color" | "colour">

because you think it is nicer to recognize the British (or British/Canadian/Australian etc) spelling as well. With the literal string externalized, you can do things like this. Well, I was just thinking about cases where you INCLUDE some grammar and want to redefine certain tokens a bit.

So, once I decided it was better to externalize all the strings, I went through and did it for the 3 main reference grammars -- Java, Python, and CSharp.

But this also got me thinking about grammar inheritance more generally, or really just inheriting elements from an included grammar. One thing is in a real OOP language that has inheritance for real (not just pretending like we're doing so far...) when you override a method, you still have access to the "super" method typically. So you can do:

    void myMethod() {
       // in the base class
    }

and then in the overridding method:

  @Override
  void myMethod() {
        // do some stuff
        super.myMethod();
      // maybe do some other stuff
  }

Or, IOW, you can still invoke the overridden method, while if you redefine a syntactical production in JavaCC 21, you have completely clobbered the production you just redefined.

It occurred to me that it would not be too hard to allow somebody to redefine a production and also access the one that was overridden. The only thing would be the notation.

     SwitchStatement :
          SwitchStatement$Overridden
          |
          some alternative stuff
    ;

In fact, the way I'm thinking of doing it is just generating the method corresponding to the overridden production, but using some munged name, and if it is used, fine, but if it is unused, it just gets "reaped" by the DeadCodeEliminator logic. The method generated has to be private for this to work, but that basically makes sense.

This strikes me as useful because a production that is overridden is usually replaced by something pretty similar to, though a bit different obviously, so being able to re-use the overridden production this way could avoid a lot of copy-paste antipattern. Actually, it might be common that the overriding production just is basically the same but you just want to put in an extra assertion or two for sanity checking purposes maybe.

Another thing that occurs to me is that it may well be that a significant case in terms of redefining grammar productions is solely to redefine the tree-building annotation. So it should probably be possible to write:

   MyProduction #(someCondition()) ;

So, build a node if some condition is satisifed. But other than that, just reuse the definition of MyProduction that was included.

Well, enough said for now, I suppose. Of the stuff I'm talking about above, the only one that is implemented is the ability to redefine a token definition. BTW, it does warn you or tell you that you are doing that, to make it a little harder to do inadvertently.