Hi Robert. Thanks for the question. I'm a bit confused by it though. The straight answer is that CongoCC creates two packages, one package for the (relatively few) parser classes (and base classes like Node.java and Token.java) and another package for the various node subclasses in your syntax tree. So, yes, in this disposition, the node package would contain all the various node subtypes in your AST, both terminal (a.k.a. tokens) and non-terminal nodes. That could be a fair number, but I have a hard time conceiving of it being many thousands. As a concrete example, if you look at the Java grammar that generates 162 classes in the node package, which is a fair number, I suppose, but nothing like thousands or tens of thousands.
Well, one thing to note is that, by default it will generate a different subclass for each token type, so you could have a different token subclass for if
, for
, switch
etc. But the Java grammar actually just generates one token subtype for all of the various keywords, which is KeyWord
. You can see where that is specified here. So I suppose you would want to do something similar for the various keywords in SQL. As for non-terminal nodes, you would have things like SelectStatement
and CreateStatement
and DeleteStatement
and so on. Okay, it adds up, but I honestly don't see how you get to many thousands of node types.
So, anyway, the node classes that are generated are all in the same package. Now, here is a somewhat related point. If you want to maintain a node class by hand, in another package, you can do that by specifying the fully qualified class name in the grammar. You can see an example of that here. You see, what happens sometimes is that when you inject a lot of code into a node class, it becomes unwieldy to keep it in the grammar file and you prefer to maintain it in a source file in a separate package. That way, you can edit it in your IDE and have the full Java tooling, which, unfortunately, you don't have when you edit a code snippets in a grammar file. But, in practice, most of the code you inject into a node is very short getter/setter sort of stuff so, on balance, it is more convenient to have it in the grammar near where the relevant grammar rule is specified. But once you end up having very large amounts of code injected, it is more convenient, on balance, to maintain the file separately.
Well, that's a bit of a separate issue, but I thought to mention it. But, anyway, maybe look at this carefully. Do you really have many thousands of node types getting generated?