Migrating to CongoCC (any snags should be quite minor)

revusky

I thought to write a little update on what I've been doing over the last week -- a bit over a week, I guess, more like ten days. There has been a fair bit of housekeeping and so forth that has taken place but really CongoCC is quite rock solid, no less so than the last builds of JavaCC 21. It already self-bootstraps (i.e. is used to build itself) and runs all the same integrated functional tests -- principally, parsers for Java, Python, and CSharp. So I would strongly encourage people to move to using Congo now. I am really quite sure that there is no reason not to. (You can pick up a prebuilt jarfile. Just try it and I think you'll see that the disruption is quite minimal.

Note, however, that CongoCC is not fully backward compatible with JavaCC 21. So let me outline the various issues in migrating:

Doubtless the biggest single issue in migrating is that CongoCC removes support for all the legacy JavaCC syntax. Most of the various issues can be dealt with automatically using the syntax converter that is in recent versions of JavaCC 21.

 java -jar javacc-full.jar convert grammarfile

There is some chance that this is all you need to do. However, it is more likely, at least for non-trivial projects, that your project will not build without a few extra little tweaks. Most (probably all) of this just amounts to adjusting the various import statements. Here is a rundown of the relevant changes to be aware of:

The tool now always generates two packages if you have tree building enabled: the parser package and the node package.
You cannot generate code into the default/unnamed package. If you do not specify a parser package or a node package they will be generated, as described below.
There is no XXXConstants interface any more. If any of your own code explicitly refers to that, you will naturally need to do some adjustment.
The BaseNode class is now generated in the node package, not the parser package. I think was a design mistake before and I take the opportunity to fix it. In any case, what with all the various import adjustments, this is just one more.

Another point that should be mentioned is that you may need to put the option LEGACY_GLITCHY_LOOKAHEAD=true at the top of your grammar file, if you never made any adjustment for that. This is explained here. Aside from the legacy glitchy lookahead issue (that that is now off by default) it really seems to me that if you manage to get your project to build with CongoCC, it should work just as well as before. (But if that is not the case, by all means do report back!)

Oh, here are another couple of things that are unlikely to affect you, but I mention them for completeness sake:

The generated XXXLexer class is assumed to descend from the abstract base class TokenSource so if you had something like: INJECT LEXER_CLASS : extends MyBaseLexerClass this should now be: INJECT TokenSource : extends MyBaseLexerClass and that should resolve the issue. (I think so...)
You cannot generate or inject any code outside of the aforementioned two packages. There was a way to do that in JavaCC 21 that probably nobody was using because it was not really properly documented anywhere and was not being used in any internal code or example.

The above is really all you need to get going. If you're not so much into reading, you can stop here. I think the above hits all the main points. But you could also read on for a bit more detail.

The `XXXConstants` interface is gone.

I guess this could be under the rubric of "general housekeeping". Using a XXXConstants interface as a way of defining constant variables is generally recognized as bad practice. Joshua Bloch, in Effective Java dubs this the Constant Interface Antipattern. (I just learned the term the other day from some Googling. I never read the book in question, even though it has apparently been quite influential, likely because it was written by a Java team insider. Well, besides, Vinay Sajip told me at least a couple of times that he hates this coding pattern too, so I am finally taking the opportunity to get rid of it.) Generally speaking, implementing an interface solely for notational convenience, i.e. the ability to write RED instead of Color.RED, gives off a certain code smell. It does seems like something of an abuse of the type system. But this whole XXXConstants thing was there from legacy JavaCC and actually, in JavaCC 21, it only had two things in it, two generated Enum types -- LexicalState and TokenType. Now the LexicalState Enum is defined inside your XXXLexer and the TokenType Enum is defined inside the Token class. This basically means that any client code that had:

 import foo.parser.FooConstants.TokenType;
 import foo.parser.FooConstants.LexicalState;

this would be need to be:

 import foo.parser.Token.TokenType;
 import foo.parser.FooLexer.LexicalState;

And any static imports would also have to adjusted, so for example:

 import static foo.parser.FooConstants.TokenType.*;

becomes:

 import static foo.parser.Token.TokenType.*;

Actually, it just occurred to me that the Syntax checker in the latest versions of JavaCC 21 could automatically convert the above imports. BUT... only in code injections in the grammar, they would still have to be adjusted manually in any Java code written separately...

No Support for the "default" or "unnamed" package

This is actually related to the previous point, because static import is really the better way of doing these things, and thus avoiding the CIA (Constant Interface Antipattern).

(It does seem like a good idea generally to steer clear of the CIA!)

But there is also a hypertechnical point here: you cannot do a static import from the default/unnamed package, so this leads to a general problem, the best solution to which I finally decided was just disallowing the default package! Besides, there is no real reason to ever put any Java code in the unnamed/default package. Really, it is just a convenience for writing little toy examples that all exist in the same directory. The way it works now is that if you don't have the PARSER_PACKAGE setting set, it just uses the class name of the XXXParser (converted to lower case) as the package name.

So, if you have a grammar that generates a FooParser.java (and FooLexer.java) but does not specify PARSER_PACKAGE, it just generates all the code in package fooparser. So, the fully qualified name of your parser then is the rather Dmitry Dmitryevich-ish (or Patrick Fitzpatrick-ish) name of:

 fooparser.FooParser

But, if you find that ugly, then by all means define a separate package name using the PARSER_PACKAGE= setting. It occurs to me that this could a be variant on your "Miranda rights" (which I don't have since I don't live in the U.S.)

If you cannot afford ~~an attorney~~ a package name, one will be appointed to you.

So there you go...

Separate `NODE_PACKAGE`

In CongoCC, the generated Node types are generated in a separate package even if none is specified. (In JavaCC 21, it generated the nodes in the same package as the parser if no NODE_PACKAGE was specified -- and that would be the unnamed package if no PARSER_PACKAGE was specified either.)

So, now, in CongoCC, if you let things happen by convention (and really what's wrong with that?) then if you have a grammar called Foo.ccc (I figure we'll use the .ccc extension by default/convention, but there is nothing special about it, you can use something else if you want.) and you don't specify anything else, it will generate:

FooParser.java and FooLexer.java in the package fooparser
The various node classes in the package fooparser.ast

So, you can see that, if you don't specify anything, then the thing generates something that more or less makes sense -- as opposed to dumping everything into the default/unnamed package!

Oh, another little detail is that now BaseNode.java is generated in the node package, not the parser package. (And again, note that if you use Congo, there is always a separate node package.) So this can mean that in your own client code, you would likely have had in various spots:

 import foo.parser.BaseNode;

and that would need to be changed to:

 import foo.parser.ast.BaseNode;

Now, regarding this thing of the parser package and the node package, here is another little detail that, most likely, doesn't affect anybody.

In JavaCC 21, there was a way you could generate classes outside of either directory. You could write:

INJECT : {
    package foo.bar;

    // And whatever type declarations 
}

Any types specified in there would be generated into the foo.bar package. The current situation is that the package declaration in this sort of INJECT (where basically everything in between { and } is a java compilation unit) is now just ignored. (I guess I'll put in a warning at some point that the package is ignored here...) So the upshot of that is that any code that is generated is now generated in either the parser package or the node package, nowhere else, and I think that probably simplifies things conceptually. And again, in Congo, those two packages are always generated -- unless you have set TREE_BUILDING_ENABLED to false, in which case, there is only the parser package.

I had intended to explain at more length the reasons behind all this refactoring, but I'll explain it separately, I think. It would be nice to get a bit of feedback from people regarding just how hard (or easy) it was to migrate to CongoCC. I really don't think it should pose much of a problem (though I could be wrong.) I think the syntax converter does most of the nitpicking work. These changes to import statements and such really shouldn't be much of a big deal. Again, my advice would be to do the migration in a branch and then, when you're confident, merge it to master/main. (And then there should be no need to look back!)

adMartem

Other than the one apparent bug I reported, everything I encountered was covered by this discussion. I.e., easy peasy.