Greetings,
The legacy JavaCC (or really the tree-building add-on JJTree) suffers from a longstanding problem. It does not at all contemplate the problems of multi-language (a.k.a. polyglot) projects. Basically, if you have two (or more) JavaCC grammars in your project, each one will generate its own Token
class, its own Node
interface, and so on, and it seems that nobody really envisaged any need for these generated classes to interoperate at all. This problem was also present in its successor JavaCC 21, but I am happy to report that it is resolved in CongoCC.
Now admittedly, the existing state of affairs was fine and dandy as long as these things were part of completely separate subsystems that have nothing to do with one another. In fact, arguably, then it is even a good thing! Certainly, that the tool generates code that is totally self-contained with no dependencies (other than the core Java library) is often thought of as a strong point.
However, there are surely many cases where the natural way to model the problem is to generate a single AST, even though different parts were generated by different parsers. For example, webpages are in HTML, but they also may contain snippets of other languages inside, like CSS or Javascript. One can even think of Javadoc comments in a Java source file as an embedded mini-language. Actually, come to think of it, in JavaCC 21, the situation is even more gnarly because the core Token
construct was retrofitted to extend Node
precisely so that the tokens could be terminal nodes in the tree. However, there was no disposition for Token
types generated by different grammars to co-exist in the same tree. This is not only for the aforementioned reason that the different Token
types have different root Node
APIs, but aso because each Token
has its own TokenType
enum -- and, well, these things don't interoperate either!
This whole problem is now basically solved in CongoCC. (Though there may be some rough edges to get rounded out over the coming short while...) Right now I'll outline the the general solution:
The ROOT_API_PACKAGE
setting
The new ROOT_API_PACKAGE
setting is really the linchpin. When we set this, it means that we are not going to generate a Node
interface from this grammar, but rather, we are going to reuse the base Node
API that was generated by another grammar. You can see that being used here and here also. So, for example, we have:
PARSER_PACKAGE=org.congocc.parser.python;
ROOT_API_PACKAGE=org.congocc.parser;
This means that (as before) we generate the PythonParser
in the package set by PARSER_PACKAGE
, which is specified here as org.congocc.parser.python
, BUT the other (new) setting, ROOT_API_PACKAGE
, says that we are going to re-use the base API generated for the overall CongoCC parser. And you see similarly, that the CSharpInternal.ccc
file contains the similar lines:
PARSER_PACKAGE=org.congocc.parser.csharp;
ROOT_API_PACKAGE=org.congocc.parser;
What this means in both these cases is that all the generated Node
and Token
types in the system end up extending the common org.congocc.parser.Node
API. Or, in other words, the Python and CSharp parsers generated for internal use inside the CongoCC tool itself generate a tree that can be added to the overall tree for a CongoCC grammar file even though they actually were generated by separate parsers from separate grammars.
So, in the specific case of the CongoCC grammar itself, the approach taken is to define a separate node type called UnparsedCodeBlock
and you can see the implementation here. Any code that is not parsed by the CongoCC parser is taken to be in an UnparsedCodeBlock
and, on that level, the content is just dealt with lexically -- i.e. we just scan forward looking for the special string (which is $}
in this case) to end the unparsed content. In other words, the UnparsedCodeBlock
starts with {$
and ends with $}
on the assumption that it is very uncommon for the terminating sequence $}
to occur in any embedded source code. (Is that assumption wrong?) So when we parse Foo.ccc
the unparsed content is ignored and left for a second pass. In fact, the subtree in the embedded language that the UnparsedCodeBlock
contains can be created lazily and added to the tree in a second pass. (Note also that if the code in that block turns out to be syntactically invalid in whatever embedded language, that does not prevent the rest of the grammar file from being parsed or the construction of the overall tree. We just end up with a node that contains invalid code.)
I think the above describes the essence of the situation and is probably enough to get going with this, assuming you need this feature yourself. There are some more detailed aspects of all this that you may not need to even know -- certainly not initially -- but if you are curious, by all means read on, though it is optional.
Dealing with Multiple Token Types
Of course, the devil is in the details and certain additional refactorings were necessary to get this all working. One technical hurdle was that the different parsers still have their own separately generated Token
and Token.TokenType
. Once we envisage sets (or lists or streams...) of tokens that are heterogeneous, i.e. they were generated by different parsers, and thus have different TokenType
enums, we see that we really need a way to refer to these things with a common base API. So, you will note that the generated Token
class and TokenType
enum now implement two new interfaces, Node.TerminalNode
and Node.NodeType
respectively. So, one very significant aspect of this refactoring is that the base Node
interface, to be generally reusable in a polyglot setting, needs to refer to any tokens and token types (that potentially come from other parser subprojects) exclusively via those root interfaces.
In JavaCC 21, the generated XXXLexer
object really combined two different functionalities, the actual tokenization (the NFA loop) and also a kind of file map that kept track of starting line positions and such. It became obvious (while implementing) that there was a need to be able to refer to multiple XXXLexer classes via a common API, so, in Congo, the file-map/location functionality is broken out into a separate abstract base class TokenSource
from which all the XXXLexer
instances descend. Thus, in a polyglot project, whatever various lexer objects can be referred to via a common TokenSource
API. Note also that the TokenSource
API only refers to Node.TerminalNode
and Node.NodeType
, never to the concrete implementation of Token
and Token.TokenType
. But again, all this refactoring is basically non-disruptive to existing users, in this case because the more abstracted API is assignment compatible with the concrete implentations, i.e. Node.NodeType tok
can be assigned any member of whatever TokenType
enums are generated.
Generalization of TokenType
The case of Node.NodeType
is actually technically interesting (at least I find it so) because it resides on the fact that a Java enum, though it cannot be a subclass or be subclassed, it still can implement an interface. (While legacy JavaCC simply used static final int
to define the token type constants, JavaCC 21 used type-safe enums.) And that led to the use of java.util.EnumSet
to represent sets of token types, such as first set, which is the set of token types that can begin a production. The use of EnumSet
, by the way, is very (I mean VERY) computationally efficient -- particularly if the enum type has 64 elements or fewer, because in that case, the information of which elements are in the set are held in a single primitive long
variable and checking whether an EnumSet contains a given element boils down to checking whether a given bit is set in that variable. Well, there is a little bit more overhead if the enum has more than 64 elements, because then it stores the information internally in a long[]
array rather than a single long
, and consequently a bit more storage requirement and an extra level of indirection, accessing the member of an array, but... even then it is surely pretty close to being free. Well, in short, I do like EnumSet
because it is both very notationally convenient AND extremely computationally efficient. (What is there not to like?)
The problem is that a generalization of a potentially heterogeneous set of these enum types cannot be held in a single EnumSet<TokenType>
instance. However, they still can be referred to from a common API. For example, if you have:
Set<? extends Node.NodeType> expectedTypes;
that is assignable from any EnumSet<TokenType>
instance. Thus, for example, to generalize the API for error recovery and such, we could scan forward and look for a token whose type is contained in a set. And that could be expressed by the above Set<? extends Node.NodeType>
except that the underlying implementation is typically the very compact and efficient EnumSet<MyToken.TokenType>
.
Well, I'll be totally honest here and say that all of this has been quite an intellectual adventure, because honestly, I didn't really fully understand Java generics. Well, I don't mean to say that I didn't understand it all, nor do I mean to say that I have a perfect understanding even now, but this round of work on generalizing the node/token stuff definitely led to a much deeper understanding than I had before. Oh, here is another little point. You would think that you could parametrize the ParseException and make it ParseException<? extends Node.NodeType>
and then when you instantiate one, it could be new ParseException<MyToken.TokenType>(...)
but that doesn't work because subclasses of Exceptions can't be parametrized. It's not so hard to understand why because the type information is only knowable at compile-time, so it can't distinguish the different types of exception at run-time. But you can have it contain fields that are themselves parametrized. So the constructor for ParseException can take as an argument a variable expectedTypes
like this:
new ParseException(Set<? extends Node.NodeType> expectedTypes, ...);
And you could pass in a first set variable, which, as things stand, is an instance of EnumSet<MyToken.TokenType>
and that is all type compatible, and it also means that when we write expectedTypes.contains(someType)
it is the super compact and efficient EnumSet
implementing it.
Well, I'll close this message here. All the refactoring should not really impose much of a transition cost (if any). Any API in the base Node
that took a Token.TokenType
now takes a Node.NodeType
but since that is assignment compatible, existing code should just continue to work for the most part. The same applies to the TokenSource
API, which only uses Node.TerminalNode
, not any concrete Token
type.
In case you did not realize it, I am quite proud of this latest round of work, because I feel I did manage to refactor everything to allow polyglot projects but in a way that is pretty much entirely non-disruptive to existing users. In fact, at the outset, I was unsure that this would even be possible!
P.S. Oh, I should mention that all of the above only works if you are generating code in Java. It does not work for C# or Python. (Yet.)