The New `ROOT_API_PACKAGE` setting for "polyglot" projects

revusky

Greetings,

The legacy JavaCC (or really the tree-building add-on JJTree) suffers from a longstanding problem. It does not at all contemplate the problems of multi-language (a.k.a. polyglot) projects. Basically, if you have two (or more) JavaCC grammars in your project, each one will generate its own Token class, its own Node interface, and so on, and it seems that nobody really envisaged any need for these generated classes to interoperate at all. This problem was also present in its successor JavaCC 21, but I am happy to report that it is resolved in CongoCC.

Now admittedly, the existing state of affairs was fine and dandy as long as these things were part of completely separate subsystems that have nothing to do with one another. In fact, arguably, then it is even a good thing! Certainly, that the tool generates code that is totally self-contained with no dependencies (other than the core Java library) is often thought of as a strong point.

However, there are surely many cases where the natural way to model the problem is to generate a single AST, even though different parts were generated by different parsers. For example, webpages are in HTML, but they also may contain snippets of other languages inside, like CSS or Javascript. One can even think of Javadoc comments in a Java source file as an embedded mini-language. Actually, come to think of it, in JavaCC 21, the situation is even more gnarly because the core Token construct was retrofitted to extend Node precisely so that the tokens could be terminal nodes in the tree. However, there was no disposition for Token types generated by different grammars to co-exist in the same tree. This is not only for the aforementioned reason that the different Token types have different root Node APIs, but aso because each Token has its own TokenType enum -- and, well, these things don't interoperate either!

This whole problem is now basically solved in CongoCC. (Though there may be some rough edges to get rounded out over the coming short while...) Right now I'll outline the the general solution:

The `ROOT_API_PACKAGE` setting

The new ROOT_API_PACKAGE setting is really the linchpin. When we set this, it means that we are not going to generate a Node interface from this grammar, but rather, we are going to reuse the base Node API that was generated by another grammar. You can see that being used here and here also. So, for example, we have:

PARSER_PACKAGE=org.congocc.parser.python;
ROOT_API_PACKAGE=org.congocc.parser;

This means that (as before) we generate the PythonParser in the package set by PARSER_PACKAGE, which is specified here as org.congocc.parser.python, BUT the other (new) setting, ROOT_API_PACKAGE, says that we are going to re-use the base API generated for the overall CongoCC parser. And you see similarly, that the CSharpInternal.ccc file contains the similar lines:

PARSER_PACKAGE=org.congocc.parser.csharp;
ROOT_API_PACKAGE=org.congocc.parser;

What this means in both these cases is that all the generated Node and Token types in the system end up extending the common org.congocc.parser.Node API. Or, in other words, the Python and CSharp parsers generated for internal use inside the CongoCC tool itself generate a tree that can be added to the overall tree for a CongoCC grammar file even though they actually were generated by separate parsers from separate grammars.

So, in the specific case of the CongoCC grammar itself, the approach taken is to define a separate node type called UnparsedCodeBlock and you can see the implementation here. Any code that is not parsed by the CongoCC parser is taken to be in an UnparsedCodeBlock and, on that level, the content is just dealt with lexically -- i.e. we just scan forward looking for the special string (which is $} in this case) to end the unparsed content. In other words, the UnparsedCodeBlock starts with {$ and ends with $} on the assumption that it is very uncommon for the terminating sequence $} to occur in any embedded source code. (Is that assumption wrong?) So when we parse Foo.ccc the unparsed content is ignored and left for a second pass. In fact, the subtree in the embedded language that the UnparsedCodeBlock contains can be created lazily and added to the tree in a second pass. (Note also that if the code in that block turns out to be syntactically invalid in whatever embedded language, that does not prevent the rest of the grammar file from being parsed or the construction of the overall tree. We just end up with a node that contains invalid code.)

I think the above describes the essence of the situation and is probably enough to get going with this, assuming you need this feature yourself. There are some more detailed aspects of all this that you may not need to even know -- certainly not initially -- but if you are curious, by all means read on, though it is optional.

Dealing with Multiple Token Types

Of course, the devil is in the details and certain additional refactorings were necessary to get this all working. One technical hurdle was that the different parsers still have their own separately generated Token and Token.TokenType. Once we envisage sets (or lists or streams...) of tokens that are heterogeneous, i.e. they were generated by different parsers, and thus have different TokenType enums, we see that we really need a way to refer to these things with a common base API. So, you will note that the generated Token class and TokenType enum now implement two new interfaces, Node.TerminalNode and Node.NodeType respectively. So, one very significant aspect of this refactoring is that the base Node interface, to be generally reusable in a polyglot setting, needs to refer to any tokens and token types (that potentially come from other parser subprojects) exclusively via those root interfaces.

In JavaCC 21, the generated XXXLexer object really combined two different functionalities, the actual tokenization (the NFA loop) and also a kind of file map that kept track of starting line positions and such. It became obvious (while implementing) that there was a need to be able to refer to multiple XXXLexer classes via a common API, so, in Congo, the file-map/location functionality is broken out into a separate abstract base class TokenSource from which all the XXXLexer instances descend. Thus, in a polyglot project, whatever various lexer objects can be referred to via a common TokenSource API. Note also that the TokenSource API only refers to Node.TerminalNode and Node.NodeType, never to the concrete implementation of Token and Token.TokenType. But again, all this refactoring is basically non-disruptive to existing users, in this case because the more abstracted API is assignment compatible with the concrete implentations, i.e. Node.NodeType tok can be assigned any member of whatever TokenType enums are generated.

Generalization of `TokenType`

The case of Node.NodeType is actually technically interesting (at least I find it so) because it resides on the fact that a Java enum, though it cannot be a subclass or be subclassed, it still can implement an interface. (While legacy JavaCC simply used static final int to define the token type constants, JavaCC 21 used type-safe enums.) And that led to the use of java.util.EnumSet to represent sets of token types, such as first set, which is the set of token types that can begin a production. The use of EnumSet, by the way, is very (I mean VERY) computationally efficient -- particularly if the enum type has 64 elements or fewer, because in that case, the information of which elements are in the set are held in a single primitive long variable and checking whether an EnumSet contains a given element boils down to checking whether a given bit is set in that variable. Well, there is a little bit more overhead if the enum has more than 64 elements, because then it stores the information internally in a long[] array rather than a single long, and consequently a bit more storage requirement and an extra level of indirection, accessing the member of an array, but... even then it is surely pretty close to being free. Well, in short, I do like EnumSet because it is both very notationally convenient AND extremely computationally efficient. (What is there not to like?)

The problem is that a generalization of a potentially heterogeneous set of these enum types cannot be held in a single EnumSet<TokenType> instance. However, they still can be referred to from a common API. For example, if you have:

 Set<? extends Node.NodeType> expectedTypes;

that is assignable from any EnumSet<TokenType> instance. Thus, for example, to generalize the API for error recovery and such, we could scan forward and look for a token whose type is contained in a set. And that could be expressed by the above Set<? extends Node.NodeType> except that the underlying implementation is typically the very compact and efficient EnumSet<MyToken.TokenType>.

Well, I'll be totally honest here and say that all of this has been quite an intellectual adventure, because honestly, I didn't really fully understand Java generics. Well, I don't mean to say that I didn't understand it all, nor do I mean to say that I have a perfect understanding even now, but this round of work on generalizing the node/token stuff definitely led to a much deeper understanding than I had before. Oh, here is another little point. You would think that you could parametrize the ParseException and make it ParseException<? extends Node.NodeType> and then when you instantiate one, it could be new ParseException<MyToken.TokenType>(...) but that doesn't work because subclasses of Exceptions can't be parametrized. It's not so hard to understand why because the type information is only knowable at compile-time, so it can't distinguish the different types of exception at run-time. But you can have it contain fields that are themselves parametrized. So the constructor for ParseException can take as an argument a variable expectedTypes like this:

  new ParseException(Set<? extends Node.NodeType> expectedTypes, ...);

And you could pass in a first set variable, which, as things stand, is an instance of EnumSet<MyToken.TokenType> and that is all type compatible, and it also means that when we write expectedTypes.contains(someType) it is the super compact and efficient EnumSet implementing it.

Well, I'll close this message here. All the refactoring should not really impose much of a transition cost (if any). Any API in the base Node that took a Token.TokenType now takes a Node.NodeType but since that is assignment compatible, existing code should just continue to work for the most part. The same applies to the TokenSource API, which only uses Node.TerminalNode, not any concrete Token type.

In case you did not realize it, I am quite proud of this latest round of work, because I feel I did manage to refactor everything to allow polyglot projects but in a way that is pretty much entirely non-disruptive to existing users. In fact, at the outset, I was unsure that this would even be possible!

P.S. Oh, I should mention that all of the above only works if you are generating code in Java. It does not work for C# or Python. (Yet.)

vsajip

Nice work, glad that the refactoring was possible!

revusky

vsajip Nice work, glad that the refactoring was possible!

Well, I'm pretty happy with it because being able to have all the code in a single traversable tree opens up a lot of interesting possibilities. By the way, when you get back into the code (which I hope you'll do soon) you'll discover that things have been rearranged quite a bit. Makes me think of rearranging the furniture in the house of a blind man, which could be quite sadistic, but... you're not a blind man!

There had been this tendency for a lot of cruft to accumulate. The Grammar class was just sort of a kitchen sink into which we kept throwing things, it was creeping towards 2000 lines and I finally did a significant refactoring (that is possibly not quite complete) that broke it into 3 pieces, that are in 3 separate packages. There is now org.congocc.app.AppSettings, org.congocc.codegen.TemplateGlobals, and the Grammar class itself moved from org.congocc.Grammar to org.congocc.core.Grammar. But I think you'll agree that it's a significant improvement. Personally, I find that it gets very very hard to work on these big files, which is why, when I picked up the JavaCC codebase to work on back in 2008, the very first feature I implemented was INCLUDE. (The "JavaCC community" were not interested in that. I guess those guys just don't mind editing huge files... de gustibus no est disputandum or whatever...)

In terms of the non-Java templates, I've managed to keep everything working, but the Python and CSharp templates are nonetheless lagging behind the evolution on the Java side. For example, the Lexer.java.ftl template was broken in two. We now have Lexer.java.ftl and TokenSource.java.ftl. XXXLexer subclasses TokenSource, the latter being a base API that can be reused in a polyglot project. The big nextToken() method in the XXXLexer was broken in two and should be more maintainable now, I think.

Oh, also, that whole issue described here remains unaddressed for non-Java code generation. But, generally speaking, it would be great if you did these things, not just to get them done, but also because it does sort of allows you to... I guess synch up mentally with all the stuff I've been doing... if that makes sense...

Well, aside from all that, here is a point I want to bring up. As you may have realized, I kind of try to keep track of certain code metrics. (Not using any actual code analysis tool, just my own heuristic eyeballing of things.) So, as I said, I broke the Grammar.java into three files because it seemed to be getting too big. Probably some automated tool would have told me that it was too big, but even without looking at code metrics, it was just getting hard to work with, I felt. Now, if you run the following magical incantation from the root directory

     wc -l $(find src -name "*.java") | sort -n

That produces a list of all the Java files in order of increasing line count, like so:

gram:~/projects/congo>wc -l $(find src -name "*.java") | sort -n
    28 src/java/org/congocc/core/EmptyExpansion.java
    32 src/java/org/congocc/core/RegexpSpec.java
    45 src/java/org/congocc/core/TokenSet.java
    48 src/java/org/congocc/core/ExpansionWithNested.java
    49 src/java/org/congocc/core/UnparsedCodeBlock.java
    65 src/java/org/congocc/app/Errors.java
    65 src/java/org/congocc/codegen/python/PythonFormatter.java
    76 src/java/org/congocc/core/nfa/CompositeStateSet.java
   105 src/java/org/congocc/codegen/Sequencer.java
   118 src/java/org/congocc/core/RegularExpression.java
   131 src/java/org/congocc/codegen/java/JavaCodeUtils.java
   142 src/java/org/congocc/core/nfa/LexicalStateData.java
   143 src/java/org/congocc/core/NonTerminal.java
   150 src/java/org/congocc/core/BNFProduction.java
   170 src/java/org/congocc/core/nfa/NfaState.java
   182 src/java/org/congocc/codegen/java/Reaper.java
   223 src/java/org/congocc/core/nfa/NfaBuilder.java
   275 src/java/org/congocc/codegen/java/CodeInjector.java
   325 src/java/org/congocc/core/ExpansionSequence.java
   331 src/java/org/congocc/core/LexerData.java
   339 src/java/org/congocc/codegen/java/JavaFormatter.java
   373 src/java/org/congocc/app/Main.java
   402 src/java/org/congocc/codegen/FilesGenerator.java
   509 src/java/org/congocc/core/Expansion.java
   527 src/java/org/congocc/core/Grammar.java
   552 src/java/org/congocc/codegen/TemplateGlobals.java
   646 src/java/org/congocc/app/AppSettings.java
  907 src/java/org/congocc/codegen/python/PythonTranslator.java
  963 src/java/org/congocc/codegen/csharp/CSharpTranslator.java
1865 src/java/org/congocc/codegen/Translator.java
9786 total

So, that is all the Java code that is maintained by hand. Everything else is generated. Thirty files in just under 10k LOC.... mean length just over 300 lines... Now... I hope you don't get all defensive or anything, and it's not meant as a reproach really, but I think that any objective sort of code review would now narrow in on the last three above-listed files. I mean, combined, they are 3700 lines, well over a third of the total line count. Well, in particular, there is a lot of... I guess you'd call it scaffolding... where you define sort of a parallel tree of nodes. ASTHelper and all its various descendants.

Now, maybe it's a delicate matter (I really hope not...) and I understand why you wrote the first pass the way you did, but I would put it to you that, now that we are generating the AST for both CSharp and Python internally, we really have to figure out a way to reuse that in the various spots where it can be reused. Like HERE!

So we autogenerate all these skeletal node classes in org.congocc.parser.python.ast and org.congocc.parser.csharp.ast. (By the way, this, due to my compulsive rearrangement of everything, is now generated under build/generated-java.) But I really think we need to figure out how to get rid of all this parallel scaffolding that you wrote in Translator.java (and PythonTranslator.java and CSharpTranslator.java). I think we surely can just directly use the node classes that we're now generating. Now, granted, all those generated nodes are empty, but if you need any given node to contain whatever snippet of helper code (which you will, of course!) you can INJECT it, i.e. put the INJECTs in src/grammar/PythonInternal.ccc and src/grammar/CSharpInternal.ccc directly. I think that if you manage to leverage the code that is generated in those packages this way, probably the majority of the 3700 lines there in Translator.java, CSharpTranslator.java, and PythonTranslator.java will melt away. (I don't know exactly how much, but I think at least 80% will go away.) But, the other (possible... probable...) gain is that the helper code that gets injected into the generated nodes may well be generally useful as we move on.

Naturally, I considered doing the above-described refactoring myself, but finally I thought it was better to give you the chance to do it. (That applies to the other things I mention above, like getting the non-Java templates up-to-date with the ones for Java.) Basically, the following considerations...

You wrote the first pass implementation so you can probably do this faster.
In so doing, you'll gain a deeper familiarity with the current state of things
I have sooo many other things on my plate that...

Well, anyway, I'll close the message here. There are some other issues to go over, but I'll do that separately.

adMartem

This will be very handy for some things I have planned. Thanks for laying the track before my train gets there.

revusky

adMartem This will be very handy for some things I have planned. Thanks for laying the track before my train gets there.

Well, you're welcome. Of course, I "laid the track" because it is needed for internal development. But it was not lost on me that this would eventually make the tool far more powerful in the general case. And, specifically, in your case, once your "train gets there" and you start using these polyglot features, we have more test coverage as a result. So that is always welcome.

My overall approach with new features has been to use them internally as quickly as possible. (Of course, a lot of new features were developed specifically for internal use, so that would go without saying...) This does have the issue that one "burns one's boats", i.e. you can't rebuild with an older version of the tool that does not have the new feature in question. So it entails some risk and is a bit of a judgment call. But, then the countervailing advantage is that it becomes quite difficult for the feature to be broken, because if it is, it tends to come to your attention very quickly! Admittedly, it does turn out that things can still be a bit broken because there are combinations of features that one did not anticipate, and even if each thing is working on its own (which is, in principle what unit tests would check) you have glitches in terms of combinations of features, and you don't catch this because your internal use is not using that precise usage pattern. I mean, finally, it seems to me that, once a system becomes sufficiently complex, then it becomes hard to anticipate every usage pattern. I tend to think that if one were to limit oneself solely to the feature set that exists in legacy JavaCC, Congo (or JavaCC 21) is pretty much bug free -- though that is hard to absolute prove. But at the very least, it is less buggy than the legacy JavaCC, which has various quite well known issues.

But here is a little fact that I came across a while ago. Look at this page with a history of the bootstrap jarfile used by the legacy JavaCC project. The bootstrap javacc.jar file that they use to bootstrap the project was committed on 22 January 2008. Well, I guess they would say that is wonderful, testimony to how stable the project is. But really... I think that a project of these characteristics, specifically that it self-bootstraps, if the project is actively developed, adding new features, there would eventually be a tendency to use some of the new features in internal development, and that would require one to update the bootstrap jarfile. Like, by the same token, the JDK is largely (not totally, since some core code is still written in C++, I think) written in Java itself, but once you introduce something as elegant as lambdas, say, you would eventually want to use that in internal development. My best offhand guess would be that JDK n is largely developed using JDK n-1, so if they're developing JDK 20 at this moment, they use JDK 19 to bootstrap. Or possibly they use the most recent LTS release, which would be would be JDK 17, which is 18 months old, but that you can't build the current OpenJDK using a JDK any older than that (certainly not 15 years old!)... Well, you could bet your house on that!

This project is very aggressive in terms of rebootstrapping. I just checked, out of curiosity, and the congocc.jar used internally has been rebootstrapped 15 times since 25 January of this year. Prior to that, it was using a JavaCC21 jarfile to build itself. But I set myself the early initial goal (among other things) that the project would bootstrap itself, which has been the case since 25 January. I didn't even know this off the top of my head. I had to check back. After all, that's nearly 7 weeks ago, which is ancient history!

vsajip

revusky Now, maybe it's a delicate matter

Not at all. That was the quickest way of getting a proof-of-concept going, but it's certainly not ideal going forwards.

revusky we really have to figure out a way to reuse that in the various spots where it can be reused. Like HERE

I'm only just back home from my travels (got back yesterday afternoon); once I've gone through my backlog of stuff which is higher priority, I'll certainly look at it!

revusky

vsajip Not at all. That was the quickest way of getting a proof-of-concept going, but it's certainly not ideal going forwards.

Well, not to worry. I understand the situation perfectly well. In any case, we just have to move forward from where we are. If one had to get everything absolutely right on a first pass, this would just result in paralysis anyway. There are so many things where I did a first pass, then took a step back, and realized that I needed to redo it...

In any case, even though I've taken a bit of time to respond, I was delighted to hear from you. Well, just generally, but the truth is that I was kind of dreading the prospect of having to take ownership of all that stuff myself! It is far preferable from my POV if you pick the thing up and we move forward.

So I was thinking about drafting an email to sort of inquire about that, but finally, since you showed up and answered... I'm inferring that you're still interested in working on this.

I really think that, especially once we get over the current hump with dealing with the polyglot stuff, this project really is going to be very interesting. Well, not that it wasn't interesting before, but I mean, things are getting lined up in a certain way to really attack some interesting problems. I really think so...

Now, as regards this whole question of translation or transpilation, I have to admit that I wasn't sure about that approach at all initially. Now, I'm wondering to what extent we can leverage transpilation internally. I mean, if you know that you can transpile certain things pretty reliably, even if not perfectly... Like, I mean, suppose in a template, you have:

     [#macro translate]
              [#if lang="java"]
                      [#nested/]
              [#elseif lang="python"]
                     [#translate_to_python][#nested/][/#]
              [#elseif lang="csharp"]
                    [#translate_to_csharp][#nested/][/#]
              [/#if]
       [/#macro]

Well, you get the idea... Even if you can't reliably translate everything to each target language, as long as you know that the nested block in question doesn't contain any problematic construct... Or, in other words, we don't need 100.0% reliability for this to be highly. But we do need to figure out how to structure the templates to leverage this, I think.

So, yeah, I've been thinking that maybe we could get a lot more code reuse going in the templates. And I'm also thinking that if we can get this working more reliably, maybe we can expose it via some API for third party use...

revusky we really have to figure out a way to reuse that in the various spots where it can be reused. Like HERE

I'm only just back home from my travels (got back yesterday afternoon); once I've gone through my backlog of stuff which is higher priority, I'll certainly look at it!

Well, look, I know it can be daunting to get back into the code when you've let some time go by. Maybe one little mini-project to get going again would be to get the Lua example working for both Python and C#. Well, as far as I know, the Python parser works. Well, at least it builds:

       java -jar ../../congocc.jar -lang python Lua.ccc

but:

     java -jar ../../congocc.jar -lang csharp Lua.ccc

fails. It gives:

----------
==> ${globals.translateLexerInjections(false)}
 [on line 1097, column 1 in Lexer.cs.ftl]
 ----------
  
Java backtrace for programmers:
----------
freemarker.template.TemplateModelException: Method public java.lang.String 
org.congocc.codegen.TemplateGlobals.translateLexerInjections(boolean) threw an exception when invoked on 
org.congocc.codegen.TemplateGlobals@7bf3a5d8
    at freemarker.ext.beans.SimpleMethodModel.exec(SimpleMethodModel.java:79)
    at freemarker.core.ast.MethodCall._getAsTemplateModel(MethodCall.java:47)
    at freemarker.core.ast.Expression.getAsTemplateModel(Expression.java:48)
    at freemarker.core.ast.Expression.getStringValue(Expression.java:52)
    at freemarker.core.ast.Interpolation.execute(Interpolation.java:37)
    at freemarker.core.Environment.render(Environment.java:193)
    at freemarker.core.ast.MixedContent.execute(MixedContent.java:46)
    at freemarker.core.Environment.render(Environment.java:193)
    at freemarker.core.Environment.process(Environment.java:169)
    at freemarker.template.Template.process(Template.java:228)
    at org.congocc.codegen.FilesGenerator.generate(FilesGenerator.java:208)
    at org.congocc.codegen.FilesGenerator.generate(FilesGenerator.java:148)
    at org.congocc.codegen.FilesGenerator.generateAll(FilesGenerator.java:139)
    at org.congocc.core.Grammar.generateFiles(Grammar.java:149)
    at org.congocc.app.Main.mainProgram(Main.java:318)
    at org.congocc.app.Main.main(Main.java:290)
Caused by: java.lang.UnsupportedOperationException
    at org.congocc.codegen.csharp.CSharpTranslator.internalTranslateStatement(CSharpTranslator.java:825)
    at org.congocc.codegen.csharp.CSharpTranslator.internalTranslateStatement(CSharpTranslator.java:486)
    at org.congocc.codegen.csharp.CSharpTranslator.internalTranslateStatement(CSharpTranslator.java:712)
    at org.congocc.codegen.Translator.translateStatement(Translator.java:1576)
    at org.congocc.codegen.TemplateGlobals.translateInjections(TemplateGlobals.java:342)
    at org.congocc.codegen.TemplateGlobals.translateLexerInjections(TemplateGlobals.java:441)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at freemarker.ext.beans.BeansWrapper.invokeMethod(BeansWrapper.java:741)
    at freemarker.ext.beans.SimpleMethodModel.exec(SimpleMethodModel.java:55)
    ... 15 more

Well, the culprit is in the transpilation of the TOKEN_HOOK method that I already mentioned separately. The PythonTranslator must be successfully translating that, but the CSharpTranslator is not. Generally speaking, the CSharpTranslator seems to be far less robust than the PythonTranslator, which is actually kind of surprising, since one would think that translating to CSharp from Java would be easier.

But also, generally speaking, Vinay, I think we've really got to get more purposeful in terms of getting the word out. I mean, aside from getting these examples working, we really need some basic HOWTO/README sort of material that explains to somebody how to use these things. And I'm not really pointing fingers or anything. I've not been very good about this myself.

But I mean, for well over a year, we can generate a Python parser in python that is useable for people. It will be even better when we've got INJECT working. I anticipate we'll have INJECT working pretty soon, BTW. But even in its current state... I was looking at ANTLR more recently. I even bought the dead tree book on Amazon for like 40 bucks. Haven't read the whole thing, but AFAICS, ANTLR has nothing nearly as elegant as INJECT. They don't have a solution to lexical states or token activation/deactivation anything like we do. They don't have contextual predicates. We do need some extra polish on the fault-tolerant stuff. I honestly don't know how well ANTLR deals with that whole problem. A lot of the problem that people cite when it comes to parser generators is the whole problem of error messages. I think we probably have the machinery in place to kick ANTLR's ass on that sort of thing. Already, the ability to INJECT code into nodes and define those injections next to where the relevant productions are... properly understood, that should probably be close to being an ANTLR killer. If we can do fault-tolerant/error-messages much better, then...

Of course, the one thing that ANTLR will have over us for the foreseeable future is more target languages. They even have targets for things I hadn't even heard of like Dart. (WTF is that?)

I reckon that if we can do Java/Python/CSharp and then maybe not too far off Javascript/Typescript, then not we'll be covering most of the market. But we do need to think a few moves in advance, which would be to have the polyglot situation structured as cleanly as possible, with the most code reuse possible. Actually, BTW, that is the one thing that legacy JavaCC was allegedly doing over the last 15 years or so. Supposedly, you can use it to generate parsers in C# and C++. I suppose you can. I don't really know the state of that, but I do think that for code generation in non-Java languages, they have just about zero usage. But more importantly, they don't have any strategy for dealing with the polyglot generation problem. So, if the code, even just for generating Java, is so entangled that it's basically intractable, i.e. it works, but you can't really move forward because the thing is such a mess... just think about what a mess the polyglot thing is. They don't have anything tenable in terms of making any ongoing progress. I mean, adding more target languages, given what a mess the codebase is... it just gets more inextricable! When you have a code generation project and you won't use a template engine...

Well, I'm now rambling a bit. I have to get out the door for a dinner engagement. And I'm running late, so I close this here.