I'm just starting down the road of a grammar for ANSI SQL 2016. I expect that this will generate thousands (and potentially tens of thousands) of entries in the ast folder. Is there a way to create a package based tree?

Hi Robert. Thanks for the question. I'm a bit confused by it though. The straight answer is that CongoCC creates two packages, one package for the (relatively few) parser classes (and base classes like Node.java and Token.java) and another package for the various node subclasses in your syntax tree. So, yes, in this disposition, the node package would contain all the various node subtypes in your AST, both terminal (a.k.a. tokens) and non-terminal nodes. That could be a fair number, but I have a hard time conceiving of it being many thousands. As a concrete example, if you look at the Java grammar that generates 162 classes in the node package, which is a fair number, I suppose, but nothing like thousands or tens of thousands.

Well, one thing to note is that, by default it will generate a different subclass for each token type, so you could have a different token subclass for if, for, switch etc. But the Java grammar actually just generates one token subtype for all of the various keywords, which is KeyWord. You can see where that is specified here. So I suppose you would want to do something similar for the various keywords in SQL. As for non-terminal nodes, you would have things like SelectStatement and CreateStatement and DeleteStatement and so on. Okay, it adds up, but I honestly don't see how you get to many thousands of node types.

So, anyway, the node classes that are generated are all in the same package. Now, here is a somewhat related point. If you want to maintain a node class by hand, in another package, you can do that by specifying the fully qualified class name in the grammar. You can see an example of that here. You see, what happens sometimes is that when you inject a lot of code into a node class, it becomes unwieldy to keep it in the grammar file and you prefer to maintain it in a source file in a separate package. That way, you can edit it in your IDE and have the full Java tooling, which, unfortunately, you don't have when you edit a code snippets in a grammar file. But, in practice, most of the code you inject into a node is very short getter/setter sort of stuff so, on balance, it is more convenient to have it in the grammar near where the relevant grammar rule is specified. But once you end up having very large amounts of code injected, it is more convenient, on balance, to maintain the file separately.

Well, that's a bit of a separate issue, but I thought to mention it. But, anyway, maybe look at this carefully. Do you really have many thousands of node types getting generated?

Yeah. I started this project about 7 years ago when I was between jobs using "Legacy JavaCC", and the productions were almost 500 before I got another job and stopped working on it.

/cygdrive/c/apps/MyParser/src/main/java/com/flybyday/sql
$ find . -type f | grep \.java | wc -l
485

Yeah, it's a beast, which is why I suspect you don't see too many in the wild. This is just for the "core" functionality (create/drop/update and select/insert/delete). I never got around to functions and procedures, grants, revokes, or many of the other options listed in the 1529 page specification.

I've been trying to convert some of my old jjt files, so I don't have to re-invent the wheel, and I was able to successfully get through section 5.1 <SQL terminal character>, but I've hit a snag in 5.2 <token> and <separator>. Any hints as to what might cause an NPE in the NFA Builder?

Exception in thread "main" java.lang.NullPointerException
at org.congocc.core.nfa.NfaBuilder.visit(NfaBuilder.java:96)

It's a rather busy section, and I've tried commenting out bits of it without much luck,

    I see you even made every keyword a token. That's 560 java classes right there.

      Robert-Egan Exception in thread "main" java.lang.NullPointerException
      at org.congocc.core.nfa.NfaBuilder.visit(NfaBuilder.java:96)

      Hmm, I've never seen this problem before. My best guess on this is that you have some problem with your your lexical specification, a regexp that is malformed or something, and that is causing the NPE.

      That said, this would definitely be a bug in CongoCC, since it is should definitely provide a readable error message in such a case, not an NPE. If you would just share your grammar, even just the lexical part that defines the tokens, I'd be glad to have a look.

      Oh, and more generally, I was wondering what your plans for your SQL grammar are. Is it open-source? Or...

      Robert-Egan I see you even made every keyword a token. That's 560 java classes right there.

      Well, generating a separate class for every keyword is not actually necessary. It does that by default but you can simply give a class name at the top of the token production and it uses that class for all the tokens in there. That is what the #KeyWord here means. You can see that it does not generate a separate Token subclass for every one of those keywords, just one KeyWord.java subclass.

      But the thing is also that it is not clear why this would be a practical problem anyway. The main practical reason to want separate packages would be to avoid naming clashes. But really, there is no reason there should be any naming clashes. Really, at the end of the day, this is all generated code, so the fact that you end up with a node package with a lot of generated classes, this probably is not a real problem -- not any more than the fact that your XXXParser.java is a humongous file that you would never write by hand. I mean, all of this is really in the nature of using a code generator sort of tool. It will generate code that you would never write by hand, but that is not typically much of a problem.

        You are right. The days of file limits are long gone, and there will should never be a name clash in the entire grammar. It is simply an aesthetic issue for me. I prefer a deep folder structure as opposed to a wide one.

        I plan to publish the entire thing in github someday, free for everybody, because every time I look I never find a complete grammar. People tend to focus on the parts they care about and once they achieve that they stop building. Also, you have to pay ANSI for the "official" grammar, and few people want to do that. I did, because I got frustrated with all the websites that publish only the grammar (your site has a link to one of them), leaving you to wonder what the hell the "Syntax Rules" actually say.

        <space> ::=
        !! See the Syntax Rules.

          Robert-Egan I prefer a deep folder structure as opposed to a wide one.

          Well, that's understandable. But this is basically how JJTree worked (and works). You specify a NODE_PACKAGE setting and it puts all the generated ASTXXX.java files (assuming you use the MULTI option, which is effectively the way CongoCC works by default) in a single package. Of course, one difference is that with CongoCC, you have the ability to use INJECT to add code to the generated node classes, so typically you don't even need to bother with editing or even reading the code in the node package. It's really out of sight, out of mind for the most part.

          Anyway, if you can figure out which regular expression in the lexical part of the grammar is causing the NPE you reported, I am very interested in that. Or, if you want, just post the grammar, even the lexical part, and I'll have a look.

          8 months later

          revusky Is there a global option that has the same effect as #Keyword? Or is it required to apply #Keyword marker to all lists of tokens?

          Also I note that this is allowed:

          TOKEN [ IGNORE_CASE ] : 
              <COMMA: ",">
            | <LBRACE: "{">
            | <RBRACE: "}">
            | <LPAREN: "(">
            | <RPAREN: ")">
          etc...

          but this is not:

          TOKEN #Keyword [ IGNORE_CASE ] :  
              <COMMA: ",">
            | <LBRACE: "{">
            | <RBRACE: "}">
            | <LPAREN: "(">
            | <RPAREN: ")">
          etc...

            opeongo

            All that #Keyword does in that spot is that it specifies that the token types in this group generate a separate Token subclass called Keyword. It does not do anything besides that. If you have [ IGNORE_CASE ] in there, it would have to be in this order:

                    TOKEN [IGNORE_CASE] #Keyword :

            Oh, by the way, if you had:

             TOKEN #Keyword : 
                   <COMMA: ","> #Comma
                   etc

            then it would generate a separate subclass Comma that descends from Keyword, which in turn inherits from Token.

            So you have the ability to generate some class hierarchies pretty economically, if you want. Actually, you might prefer for these tokens to be Separator rather than Keyword maybe, but the principle is the same.

              revusky Thanks for the explanation about the order of the #Keyword and [ IGNORE_CASE ], that helps.

              It does appear that when I add the #Keyword modifier that the tokens do not generate node files. That is my goal, to not have all of these node files generated that will never be instantiated.

              If there was another way to do this I would gladly use it.

                The NODE_PACKAGE option seems to be opinionated. If I set the NODE_PACKAGE package name to be the same as the PARSER_PACKAGE package name then it seems that and ".ast" suffix will be added to the NODE_PACKAGE package name.

                Is there any way that the NODE_PACKAGE can be the same as the PARSER_PACKAGE?

                  opeongo
                  Well, I think if you just use #Token then all of the token types in the group are just instantiated as instance of the base Token class, which would basically be legacy behavior.

                  opeongo

                  Is there any way that the NODE_PACKAGE can be the same as the PARSER_PACKAGE?

                  Well, I think the answer to that is no. At some point, based on certain practical reasons, I just decided that that this was how it was going to work, a separate package for the core API and the generated nodes. Also, I got rid of the possibility of generating code into the unnamed package.

                  But it just seemed somehow cleaner and easier to be opinionated and make people have the two packages. But I'd have to ask: aside from it being different from how you did things before, does this really create any particular inconvenience?

                    revusky The inconvenience is that I have about 200 hand coded node classes that I would have to edit to change the package name. Some of them access package private methods from the parser/expression package, so that would have to change also. Nothing monumental, just a bit of churn

                    I don't understand how congocc will work with my existing hand coded node classes. Is it even possible to have congocc use my existing node source code files?

                    Here is what I did:

                    1. defined the NODE_PACKAGE path
                    2. put all of my node source code in that folder
                    3. ran congocc to generate the parser
                    4. all my node source code files got clobbered. I was expecting that if there was a file there then it would get 'reused' not overwritten.

                    Does congocc require the INJECT option to add code into the nodes? Is this the only way? I must be missing something.

                      opeongo Does congocc require the INJECT option to add code into the nodes?

                      Well, not exactly. Using INJECT is the preferred way of doing things, yes. But you actually do have the option of maintaining Node subclasses by hand.

                      I mean, the typical and preferred usage pattern would be something like:

                        AdditiveExpression : 
                              MultiplicativeExpression ( ("+"|"-") MultiplicativeExpression )*
                        ;

                      So, the code generation is going to generate a class, AdditiveExpression and that is in the NODE_PACKAGE, right? Furthermore, if you want to inject some code, like a custom method into that into that AdditiveExpression.java source file, you use INJECT, like:

                          INJECT AdditiveExpression : 
                          implements Expression
                          {
                                public Expression getLhs() {
                                      return (Expression) get(0);
                                }
                          }

                      Okay, the above is the preferred usage pattern. It has certain advantages, like the fact that the generated source file (in this case AdditiveExpression.java) is just regenerated each time (with the injected code) so, in terms of doing a clean rebuild of the project, things are quite clean. (Generating files and the subsequently post-editing them is not a good pattern!) Also, the above INJECT would typically be put right next to the AdditiveExpression production in the grammar file itself. So the code is just easier to read, since things that are relevant to one another are located in the same place.

                      Now, all that said, the above is the preferred usage pattern. But you can maintain the AdditiveExpression.java file yourself, and if you do that, it can be in any package you want.

                      The way you would do that is by fully specifying the package name in the grammar production. So you would write:

                       AdditiveExpression #com.mypackage.AdditiveExpression :
                              MultiplicativeExpression ( ("+"|"-") MultiplicativeExpression )*
                        ;

                      When the node class is fully specified like that, it tells the system that it does not need to generate the class, you're taking care of it.

                      Actually, you can see that this is actually used internally somewhat. See, for example: https://github.com/congo-cc/congo-parser-generator/blob/main/src/grammars/CongoCC.ccc#L697

                      I decided to maintain BNFProduction.java separately by hand, but actually, it might be a fairly marginal decision. (See: https://github.com/congo-cc/congo-parser-generator/blob/main/src/java/org/congocc/core/BNFProduction.java ) The main reason to do that boils down to the fact that we don't have Java tooling inside the grammar file, so if the injected code in a Node subclass gets a bit hairy, we prefer to have auto-completion and the rest of it when editing the code, which we don't have when editing the code embedded in the grammar file. But the thing is that probably most of the times you inject a method into a Node, it's really pretty trivial, along the lines of getXXX/setXXX sorts of things, and there is not much need for any tooling. So the advantage of having the injected code very near to the grammar production is the bigger consideration. Of course, if there was decent tooling for editing CongoCC grammar files, then...

                      Well, anyway, that was a point that was on the tip of my tongue to point out, that if you have a bunch of hand-coded Node classes, you can take this approach. And then those hand-maintained source files can be in any package you want!

                      All that said, you would probably be better off gradually moving towards the more approved code pattern.

                      But anyway, the bottom line on hand-edited/maintained Node subclasses is that you can do that by fully specifying the package name in the grammar and then they can be in whatever package you want to put them in. But if you use the preferred coding pattern, then the files are generated in the NODE_PACKAGE package. So I hope that's clear now...

                        revusky Got it, thanks. That is clear now.

                        I think that I will keep the source code organization that I currently have.

                        I have between 150 and 200 node classes, with several 10k's sloc in them. I don't think it would be practical to merge this all together into a single grammar file. And if it was instead split in to multiple grammar files that were included together, well how is that really different than maintaining the node files separately?

                          opeongo
                          Well, I can't really make too much of a comment on what you're doing since I haven't seen your code.
                          I would just make the point that the two approaches (I mean hand-maintaining a given Node subclass and having the Node be generated) are not an all-or-none choice anyway.

                          You can hand-maintain certain Node subclasses and have other ones be generated.

                          Another intermediate approach that it occurs to me to mention is that you can have your own intermediate Node class that you maintain by hand, like:

                                public class MyBaseNode extends BaseNode {...}

                          and if you want the a generated class to extend that MyBaseNode class, you can specify that in an injection, as in:

                               INJECT SomeNode : extends somepackage.MyBaseNode {
                                       (maybe some injected code here)
                               }

                          So you have a lot of degrees of freedom to structure code as you wish. I would say that, generally speaking, if your concrete XXXNode class only has one or two very short methods in it, it would be preferable to have the source file be generated and just inject the methods. In particular, if you have hundreds of these little node classes that have to be hand-coded and committed to the code repository, it would be better if they are just generated. I believe that if you give it a try, you would see the advantages of it. And you can always hand-maintain some of the files. It's not an all or none decision.

                          Write a Reply...