[Roadmap] Polyglot Problems

revusky

JavaCC has a long-standing issue that, admittedly, would not affect most users: it is very clumsy to have a single project with multiple JavaCC-generated parsers.

Well, probably most software projects that make heavy use of JavaCC maintain a single grammar. For example, the project that I devoted a lot of time to back in the 2000's basically revolved around a single big JavaCC grammar, for the FreeMarker template language. To this day, there are various other projects like that, like JSQLParser that revolves around a single humongous JavaCC (or JJTree more precisely) grammar that is over 6000 lines. (The project maintainers apparently see no benefit to having an INCLUDE directive. Shrug.) In any case, that is probably the more common case. Even a project that maintains multiple JavaCC grammars may have no need to do any combining of the resulting AST's.

Now, actually, to be precise about this, I guess it is possible to post-edit the generated XXXNode classes so that they extend a common Node API, and it is also true that, since the legacy JJTree is based on the (rather horrible) anti-pattern of post-editing generated code, maybe this was just never considered a problem anyway. However, JavaCC 21 (even back in 2008 when it was called FreeCC) was based on the firm belief that post-editing generated files was a no-no!

Now, as regards the JavaCC project (both the legacy project and JavaCC 21) the approach used was simply to make the JavaCC grammar a superset of the Java grammar. Actually, as long as one is only supporting one output language, this is a quite pragmatic approach -- just reuse the syntactical constructs defined for Java itself. For example, about a year ago, I implemented an ASSERT directive in JavaCC 21. There are two kinds of assertions, the one expressed in Java code, and the one expressed as a lookahead expansion. But, in terms of the first kind, see here:

"ASSERT"
(
 "{" 
    Expression 
    {CURRENT_NODE.setAssertionExpression((Expression) peekNode());}
 "}"
 ["#" {CURRENT_NODE.setSemanticLookaheadNested(true);}]

But actually, leaving aside the fact that the {...} can be followed by # to indicate that it applies inside a lookahead, really, the assertion that is based on Java code could be expressed as:

 Assertion : "ASSERT" "{" Expression "}" ;

We can just re-use the definition of an Expression that comes from the Java grammar. And, really, that is fine and dandy as long as you don't intend to support any code actions/insertions in any language other than Java. But once you do want to support other languages, then things obviously get more complicated.

For a long while, I figured that we would need to extend this approach to be aware of other languages, so ultimately you would end up having something like:

 Assertion : "ASSERT" "{" (JavaExpression | PythonExpression | CSharpExpression) "}"

And then we would just INCLUDE the grammars for the various languages, like:

 INCLUDE "Java.ccc"
 INCLUDE "Python.ccc"
 INCLUDE "CSharp.ccc"

And then the CongoCCParser that results from this ends up being this monolithic thing that includes within it the ability to parse all three languages. However, this does lead to other problems because of the lack of any namespaces, i.e. we have no way of insuring that the things in one grammar do not clobber things in another grammar. In fact, grammars for any language are bound to contain productions named Expression or Function or whatever. Or it will contain token definitions named IDENTIFIER or NAME, for example. So either one very consciously names these things a bit differently in each grammar, or one has some logic built in, that the Expression production defined in the Java grammar is distinguished by some sort of namespace facility.

In any case, we could call that the monolithic approach. The CongoCC grammar ends up being this monstrous thing that just includes all the supported languages. (At least we have INCLUDE!) Probably, with the monolithic approach, attacking the namespace problem becomes very necessary. And then one ends up having to have an IMPORT where you say:

 IMPORT "CSharp.ccc" in CSharp
 IMPORT "Python.ccc" in Python

where CSharp and Python are namespaces. And then you can distinguish CSharp.Expression from Python.Expression and so on.

Another approach would be more the preprocessor approach, so you have something more like:

#if PYTHON
   INCLUDE "Python.ccc"
#elif CSHARP
    INCLUDE "CSharp.ccc"
#else
   INCLUDE "Java.ccc"
#endif

And you end up generating 3 different CongoCC parsers that include the respective grammars, based on whether you built with the preprocessor symbol in question.

Well, time for a plot spoiler. I finally rejected both of the above approaches as being unwieldy. I mean, okay, with just two or now three supported languages, well maybe... but if you look ahead to having 5 or even 10 languages... I'm pretty sure neither approach can really scale very well.

I think the approach to be taken will be that, within the CongoCC grammar, we'll just deal with snippets from languages (at least, if they are NOT Java!) at the lexical level, so I just anticipate having:

 <CODE_SNIPPET : "${" (~["}"] | ("}"~["$"])* "}$" >

That is my current idea. Any code snippet is just a (potentially quite big) token of the form ${...}$ . This, obviously, is based on the heuristic that the sequence } followed by $ used to terminate the token does not occur very often in actual code in hardly any language. And I was always thinking that if, by some chance it does occasionally, we could also allow snippets of the form {$...$}. Basically, what is needed is for this token to be terminated by some string that just about never really occurs in the target language. As far as I can tell, the dollar sign $ simply cannot occur in regular Python code, though it can be in a comment or string literal, of course. But the precise sequence }$ is really just not going to occur hardly ever, even in a comment or string literal in Python or hardly any other language. And if, somehow it does, you could use the alternative form that terminates with $}.

Of course if the CongoCC parser makes not the slightest attempt to parse what is between ${ and }$ then that will eventually be delegated to a parser that can parse this at a later stage. So the above definition of an assertion ends up looking like :

   Assertion : "ASSERT" "{" (Expression | <CODE_SNIPPET>) "}" ;

(I anticipate that, for now, the CongoCCParser will continue to INCLUDE the Java grammar, but if you want to inject code in another language, you just use the CODE_SNIPPET token where the code snippet is clearly delimited.)

Then, at a later stage, one could envisage that we just run over the tree and replace the CODE_SNIPPET tokens with the subtree corresponding to the code snippet from the language.

EXCEPT... there is another little problem here, which is that if we simply package the Python and CSharp parsers inside the tool so that it can parse the CODE_SNIPPET tokens, we can't (at the moment) do the most natural thing which is just to have a sort of single-rooted "über" tree in which we have replaced the CODE_SNIPPET tokens with the appropriate subtree.

Why?

Because the root Node object in the Python AST is not the same as in the CSharp AST or the CongoCC AST. If you have different parsers generated from different grammars, an AST generated by one grammar cannot be added as a subtree to the one generated by another grammar because something that is an org.foo.Node cannot have an org.bar.Node added to it as a "child". We really need to have a way of having a polyglot project in which all the various Node objects for the different languages share a common base Node API.

Also (though it's a smaller problem) we have a separate ParseException class generated by each grammar. That is liable to be a PITA too. But, you know, we have things in the Node.java like:

  Token firstDescendantOfType(TokenType type) {
    .....
  }

But, of course, in a polyglot project, we have several different TokenType enums floating around. Well, the solution would have to be that the various TokenType enums all implement a common interface and the APIs like the above use that in their signature. (I recently checked, and yes, an enum can implement an interface! I didn't know that before!)

Well, a polyglot project does have this basic problem, having several different grammars sharing a common base Node API, but it looks resolvable. Right now, you can define:

        PARSER_PACKAGE = org.foolang.parser;
        NODE_PACKAGE = org.foolang.ast;

But it seems we need an additional setting something like:

    BASE_API_PACKAGE = org.foolang.baseapi;

which says where the base Node and probably ParseException and some other things shared between different AST's as a common root node and other API like a common ParseException. Of course, that is for a "polyglot" project. Existing projects that don't need any of that should basically be unaffected.

Well, I write all of the above perhaps more for my own benefit than that of anybody else. The act of writing can help one clarify one's thinking. For now a high priority is having a kind of lint program that runs over any JavaCC (legacy or JavaCC21) and outputs a CongoCC grammar, so all the legacy syntax will be translated to the newer streamlined syntax. This will allow the CongoCC.ccc grammar to be much cleaner, a better basis for moving forward. At some point, I guess the JavaCC 21 repository will just be frozen and anybody can download and use the last version with that naming. We could even put it on Maven central and all the people obsessed with concepts such as stability and reproducibility can use it, I guess.

(Since it is that season again, I found myself thinking about a version of the Charles Dickens story "A Christmas Carol", in which Scrooge is visited by the various bugs of Christmas past...)