Making CongoCC's ASTs more useful

vsajip

The AST currently produced by CongoCC is actually a CST, capturing everythinhg including punctuation tokens. This can be very useful for some applications, and in particular I use it to verify that support for lexer/parser generation in Python and C# provide the exact same lexing and parsing results as the canonical Java implementation.

However, the CST is not especially useful for certain applications. One example of this lack of utility comes from the transpiling approach currently used to provide the support for Python and C# generation. In order to transpile semantic actions written in Java into the generated Python and C# code, the AST for those actions needs to be used as a source. However, because that source AST contains lots of extraneous items like punctuation tokens and optional items, the AST needs to be transformed into a new AST, one which is more amenable as a source for the transpiler. Let's look at specific examples: the portions of the Java grammar relating to method declarations and constructor declarations (from which I have elided the semantic actions, for brevity):

#MethodDeclaration :
  (
    SCAN \.\.\InterfaceDeclaration
    =>
    { /* elided */ }
    |
    SCAN ~\...\TypeDeclaration
    => 
    { /* elided */ }
    |
    { /* elided */ }
  )
  Modifiers
  [ TypeParameters ]
  ReturnType
  <IDENTIFIER> 
  =>|+1 FormalParameters ( (Annotation)* "[" "]" )*
  [ ThrowsList ]
  ( Block | ";" )
  { /* elided */ }
;

and

ConstructorDeclaration :
  { /* elided */ }#
  Modifiers
  [ TypeParameters ]
  TypeIdentifier FormalParameters =>||
  [ ThrowsList ]
  "{"
  [ => ExplicitConstructorInvocation ]
  ( BlockStatement )*!
  "}"
;

Note that despite constructors conceptually being very similar to methods, the grammars are quite different. This means that it is hard to pinpoint common areas of interest in the AST. Even for different methods, i.e. different instances of MethodDeclaration, the ASTs are different in terms of e.g. where the return type or formal parameters are located in the node representing the declaration. Let's look at a couple of examples. This method:

void addSymbols(Set<String> symbols) {
  ppSymbols.addAll(symbols);
}

gives this AST:
addSymbols method
whereas this method:

private Token TOKEN_HOOK(Token tok) {
  /* contents elided */
}

gives this AST:
Token hook method

As you can see, the positions of the return type, method name, formal parameters and statements are shifted by one between the two cases.

For a constructor declaration, the AST looks very different. These constructors:

INJECT PARSER_CLASS :
{
   public PARSER_CLASS(Path path, java.util.Set<String> ppSymbols) throws IOException {
      this(path);
      if (ppSymbols != null) token_source.addSymbols(ppSymbols);
      System.out.println("foo");
   }
   
   PARSER_CLASS(String pathString) {
   }
}

give these ASTs:
Constructor 1
and
Constructor 2

Note that again, the positions of child nodes of interest are shifted, and when compared to the method declarations, the statements are all inline rather than being under a CodeBlock child node.

This situation makes it harder than necessary to access child nodes of interest when processing a particular node. Although the node machinery does provide mechanisms to fetch children of a particular type, this is of limited help - there might well be multiple children of the same type, and some of them might be optional in the grammar. And as the above example shows, it's not possible to access the statements in a MethodDeclaration and a ConstructorDeclaration in a uniform way.

All of this becomes even more awkward when a grammar is under development: the code using the AST needs to keep being changed due to changes in the grammar leading to child nodes moving around in positions both absolute and relative to each other,

What's needed is an unambiguous way of fetching a particular child, which is most straightforwardly done by accessing a child by name. Thus, node.getNamedChild("formals") (for example) might fetch the formal parameters node of either a MethodDeclaration or a ConstructorDeclaration in a uniform way, as a Node object. This could be achieved by changing the CongoCC grammar to allow the grammar writer to indicate the names to be used. For example, the notation FormalParameters (* formals *) might be used to indicate that the FormalParameters child will be accessible using the "formals" name. You could also use a notation such as ( BlockStatement (* [statements] *) )* to indicate that node.getNamedChildList("statements") would return a List<BlockStatement> of the statements in a constructor. For optional items which aren't present, getNamedChild could return null, and getNamedChildList() could return an empty list.

Of course I'm just using the (* *) notation as a placeholder, but what do you think of this idea?

vsajip

ngx the AST must be as faithful to the bnf as possible

Sure, maybe for all of your applications, but not for all of mine. The completely faithful AST is then more of a concrete syntax tree than an abstract one. What I proposed was leaving the existing access mechanisms as they are (there will be applications that need to know, for example, the position of every punctuation token - and in any case, none of your code would need to change because of my suggestion), but augmenting that with an approach that makes the AST more useful in certain scenarios.

vsajip What's needed is an unambiguous way of fetching a particular child, which is most straightforwardly done by accessing a child by name.

I gave specific examples where this augmentation would be useful - the transpiler code which is currently in CongoCC would be shorter, simpler and easier to reason about if one had access to access child nodes by name.

revusky

Well, you can use INJECT to expose methods that you need.

For example, the JavaCC.javacc grammar INCLUDEs the Java grammar and it injects a couple of methods into MethodDeclaration that it uses here and there. I mean:

  https://github.com/javacc21/javacc21/blob/master/src/javacc/JavaCC.javacc#L717

There is no particular reason for you not to feel free to inject any methods that you need for your own internal use.

And even having MethodDeclaration and ConstructionDeclaration inherit from a common abstract base class (or interface) so that certain methods only need to be defined once, that's not too hard to do -- I mean declaratively right in the grammar using INJECT.

Anyway, as regards the recent (or not so recent, maybe a couple of months ago at this point) shifting around of the Java grammar, I think that mostly happened because at the time I was working on the C# grammar, and since there is a lot of overlap, I saw certain ways of implementing certain constructs in C# and then got to thinking whether it would be better if the Java grammar did things that way as well. So they I tweaked things in the Java grammar.

But really, generally speaking, regarding your question (in private) about how stable the AST should be moving forward, I think it's probably going to be pretty stable. And actually, I think I would be perfectly happy at this point if you basically "took ownership" of the 3 main grammars we have (Java, Python, C#) and I just won't even touch them basically! (It might take a bit of self-control because I have had this tendency to obsessively tinker with them, but that may be gone anyway. There are so many other pressing issues to address that I shouldn't be tinkering with these grammars at all.) I think I was tinkering with the Java grammar as a kind of by-product of working up the C# grammar where I saw (or thought I saw) slightly more elegant ways of doing whatever and then did them that way in the Java grammar as well. But that shouldn't be happening now.

(Sorry to be so slow to answer this. A bunch of things came up and actually, I went on a trip with my daughter and I'm writing this note (and the other responses on SQL today) from an AirBNB apartment in Granada in the South of Spain!)

vsajip

revusky Well, you can use INJECT to expose methods that you need.

I know that, but it doesn't help much in terms of the original problem I described. Adding new methods to classes goes only so far. I'm also thinking of my suggested changes to the grammar as something that would help users of CongoCC in general, not just my current situation of working with the three example grammars.

revusky

vsajip

Well, let's see.... I'm a bit confused about what you're saying. Well, to be clear about one thing, you refer to the "three example grammars". It's true that they are the most meaty real-world examples of using the overall system. However, in particular, the Java grammar is used internally, so it's kind of more than just an example! And, as I see it, as things move forward (they will eventually even though things have been slow for the last couple of months) the other two grammars, Python and C# will also be used internally in a similar way. (At least that is how I anticipate things moving.)

Now, okay, thinking out loud... it's true that there is kind of a contradiction here maybe, which is that, say, we give the Java grammar as an "example" that somebody can freely use to build on, and, in principle, we don't know what that person is doing. So I took the approach of a very bloody-minded sort of AST building, where we don't throw away any information at all. So, okay, depending on what you want to do, the resulting tree could be cumbersome to work with.

So I guess this leads to the question of a separate tree transformation pass where you convert the AST into something more like what you want to work with. Now, one thing I have been meaning to ask you, Vinay, is whether you had ever considered using the Node.Visitor API. (I know you aren't using it anywhere, but did you look at it and consider using it and then decide against it?)

So, I mean, maybe one solution is just to write a Node.Visitor subclass that builds the tree you want to work with as a result of visiting all the nodes. I mean, generally speaking, if, in a given spot, you only need a subset of the information in the full tree, maybe you could just have a visitor object so you runs over the the full tree (or the subtree that you're interested in at that point) and just builds another tree that is more amenable to what you want to do. (Am I expressing myself clearly?)

But I do roughly that in this NfaBuilder class. I mean here: https://github.com/javacc21/javacc21/blob/master/src/java/com/javacc/core/NfaBuilder.java

So, the subtree we're interested in at that point is the one that is a RegularExpression object and by just walking the tree, we build these various NfaState start/end pairs, which are a kind of tree themselves, but you could say specialized for this Nfa stuff.

(I can't help but toot my own horn a bit that... my god, if you compare this code to what it was originally, it's just....)

So, anyway, I throw that idea out there.... Have to run. Maybe will have some other thoughts later. And again, sorry for being so slow to respond. I want to get back to a more normal situation where I'm much more responsive -- you know, like the good old days with FreeMarker, at least during the periods of time where I was really into it!

ngx

I don't want to interfere in the discussion but IMHO, the AST must be as faithful to the bnf as possible.

I personally use various approaches to access nodes:

fixed position when its fixed, e.g.

          if (n.childs[2].id == JJT_OBJECT_WRITE) {
            tableObject = (AST_object) n.childs[2];
          } else {
            tableObject = (AST_object) n.childs[3];
          }

using one of my numerous node finding methods, used essentially with repetitive constructs, e.g.

  public final List<SimpleNode> childrenOfType(int type) {
    ArrayList<SimpleNode> v = new ArrayList<>();
    _childrenOfType(v, type, -1, null);
    return v;
  }

  public final List<SimpleNode> childrenOfType(int type, int avoiding) {
    ArrayList<SimpleNode> v = new ArrayList<>();
    _childrenOfType(v, type, avoiding, null);
    return v;
  }

  public final List<SimpleNode> childrenOfType(int type, int[] avoiding) {
    ArrayList<SimpleNode> v = new ArrayList<>();
    _childrenOfType(v, type, avoiding, null);
    return v;
  }

  public final List<SimpleNode> childrenOfType(int type, int[] avoiding, MatchPredicate mp) {
    ArrayList<SimpleNode> v = new ArrayList<>();
    _childrenOfType(v, type, avoiding, mp);
    return v;
  }
// same as above but with arrays for types
  public final List<SimpleNode> childrenOfType(int... type) {
    ArrayList<SimpleNode> v = new ArrayList<>();
    _childrenOfType(v, type, -1);
    return v;
  }

  // same as above but with array of avoiding
  public final List<SimpleNode> childrenOfType(int[] type, int[] avoiding) {
    ArrayList<SimpleNode> v = new ArrayList<>();
    _childrenOfType(v, type, avoiding);
    return v;
  }

create specific fields when options are present and efficiency is involved, e.g.

AST_select _select() :
{}
{
  (
    _query_expression()
    [ jjtThis.orderByClause=_order_by_clause() ]
    [ jjtThis.computeClause=_compute_clause() ]
    [ LOOKAHEAD(2) jjtThis.forClause=_for_clause() ]
    [ jjtThis.isolationClause=_isolation_clause() ]
    [ _abstract_plan_clause() ]
    [ _FOR() _IDSTR()/*XML*/ ]
    [ _option_query_hint_ms() ]
  )
  {
    jjtThis.id = (short)selectNodeType;
    return jjtThis;
  }
}

flattening the tree when needed, e.g.

public final SimpleNode[] preorderArraySkipSubprocs() {
    ArrayList<SimpleNode> a = new ArrayList<>(); // optimize with growth size
    preorderArrayInternalSkipSubprocs(a);
    SimpleNode[] result = a.toArray(new SimpleNode[0]);
    return result;
  }

using visitor to traverse everything and transcribe to something else (here HTML), e.g.

public final class HTMLPrintVisitor implements ExplVisitor {
...
@Override
  public Object visit(AST_program node, Object data) {
    // Dec 2000 add the first line number
    HTMLPVData d = (HTMLPVData) data;
    d.addLineNumber();

    node.jjtAcceptChildren(this, data);

    // must add the last token (EOF) because of trailing comments!
    d.addSpecialTokensOf(node, node.lastToken, d.html);

    // must close properly
    if (d.html) {
      // 3 MAR 2001: No, works without that, we APPEND in batch analysis
      d.addTokenToList(FMT_HTML_SLASH_PRE);
    }
    return data;
  }
...
@Override
  public Object visit(AST_col_def node, Object data) {
    String bold = "", stopBold = "";
    HTMLPVData d = (HTMLPVData) data;
    if (d.html) { // the first child is the column name
      if (d.table != null) {
        ColumnOfTable c = d.table.getColumn(node.childs[0].firstToken.image);
        if (c != null) {
          if (c.participatesInPK() && c.participatesInFK()) {
            bold = "<b><u>";
            stopBold = "</u></b>";
          } else if (c.participatesInPK()) {
            bold = "<b>";
            stopBold = "</b>";
          } else if (c.participatesInFK()) {
            bold = "<u>";
            stopBold = "</u>";
          }
        }
      }
      d.pendingToken = new Token(Token.FMT, "<a name=" + node.childs[0].firstToken.image + '>' + bold);
    }
    node.jjtAcceptChildren(this, data);
    if (d.html) {
      if (stopBold != null) {
        d.addTokenToList(new Token(Token.FMT, stopBold));
      }
      d.addTokenToList(FMT_CLOSE_ANCHOR);
    }
    return data;
  }

...because you may not be interested by some productions day one, but maybe one day these nodes will be useful(?)
(It is true though that smart node creation trades efficiency against node access easiness)

revusky

ngx I don't want to interfere in the discussion but IMHO, the AST must be as faithful to the bnf as possible.

Oh, not to worry. By all means, interfere in the discussion! (After all, if we were intent on having a private conversation, we have the means to do that as well, so if it's here...)

I think the thing is that if you're going to have a Java grammar and tell people: "Here, you can use this in your own projects" then, yeah, it really has to just generate the most bloody-minded AST with ALL the information. And it does. You have every last superfluous delimiter and so on. This is simply because, since we can't anticipate all the ways this would be used, we don't throw away any information a user might conceivably need.

But then it might frequently be unwieldy to use, but I don't really know for sure whether the more common solution is to work with that AST and ignore all the extraneous information, or just with a Visitor, traverse the tree and generate something more streamlined for what you are doing and then work with that.

Oh, by the way, I notice that the example code you provide uses legacy syntax. Also, when you write: "using one of my numerous node finding methods", I have to think that you are aware that there are more modern versions of that sort of thing already by default in the generated Node.java. The thing is that, because of all the refactoring in JavaCC 21 (I should really start saying CongoCC all the time) the ones I put in are much more modern, idiomatic Java, so where you have something like:

   if (n.childs[2].id == JJT_OBJECT_WRITE) {

the newer more approved way would be more like:

    if (n.getChild(2) instanceof ObjectWrite) {

I guess in the original JJTree generated API, the generated ASTXXX Node objects have an id field that is a static final int (not a type-safe Enum!) that represents the node's "type". I don't know whether this kind of thing is a result of slavishly copying the way older C-based tools like YACC or Bison worked. I mean, once you're in an OOP language like Java, it just makes more sense to just use the language's type system, no? (Even back then using a much older version of Java.)

Well, I'd have to think you're using the newer API in new code, no? And also, with generics, the code is more type-safe, like where you have ArrayList<SimpleNode>, you can use the specialized node subclass instead, like:

  List<MethodDeclaration> methods = node.childrenOfType(MethodDeclaration.class);

And you can write things that are much more elegant, maybe like:

  for (MethodDeclaration md : compilationUnit.descendants(MethodDeclaration.class, md->md.isPublic()) {
       ...
  }

(The descendants method can have a second parameter that is a Predicate, which can be written using a lambda, so...)

So the above snippet would recursively iterate over all the MethodDeclaration objects in the source file that are public. (And look, Ma! No casting!)

Well, maybe you think I point out these things just to show off a bit. (And you'd be right...) But it's not such a big deal either. I got satisfaction out of doing that, but it's also kind of obvious, really, I think. I mean, once you see that all these features were added to the Java language, like generics and lambdas, it just becomes kind of obvious to use them this way in a generic Node traversal sort of API, don't you think? So, it's really kind of amazing when you look at the legacy JavaCC, just how stuck in a kind of time warp that whole thing is. Just about no use of any language feature that came after Java 1!

So that old thing is there and you have had people for years pretending that it's an active project, but really, properly understood, the situation is laughable. It's a kind of museum piece really. But then you approach people who are still using that and tell them that there is a much more modernized version of the tool and far more often than not, you can't even get them to look at it!

Well, maybe with a sexy new name like Congo... But, really, my earlier belief that you could largely get the older user base to switch over, that just turned out to be false. A few people, like you, but overall, that just doesn't work very much. There's a need to market this to new users, probably people who never used a parser generator before. I think that's more where the future of this project lies. Oh well, 'nuff said for now....

ngx

Sure, what I am showing is extracts of my 4132 lines of former JJT syntax. The full story is that I wrote this product sqlbrowser -which I always believed could help my retirement wages! - It started in 1993 actually, in C on SunOS using 'curses' if this means anything to you. I then discovered Sriram's & co tool which I found cool compared to yacc (I had the Gang of 3 dragon book on my shelf at the time). The issue is the current product works for Sybase DB only which is pretty much ditched and my today's endeavor is to generalize it to other databases.

The thing is that these database vendors revolve more-or-less around SQL ANSI standards while adding a huge amount of custom syntax, which is pretty inconvenient to parse, or should I say, to skip! I am only interested in the core syntax so-to-say. Hence the need for skip-over techniques like this negative lookahead you showed me.

On the full AST thing, why I sayt it should be complete is when you need 'reproduction' for instance, sp_addthreshold.html.
Here, the website exposes a Sybase so called stored-procedure, i.e. SQL business logic going beyond Select/Update/Insert/Delete. It is colorized and re-indented properly through the HTML visitor for that purpose.
As we can see, as you say also, we need every bit and piece of the original content.

So yes this code has been written with the original JavaCC for the last 20 years!
BTW using the token type integer field is for efficiency. Plus I had a guru OO guy telling me that if you use instanceof, something is wrong in your design. This was really 30 years ago :-)
I had to write all these custom node inspectors especially for looping constructs, to get collections of subnodes which may very well be replaced by built-in facilities. But indeed, to corroborate vsajip, fetching 'nodes of interest' is the first thing you want to to on your AST.

To potentially increase adoption, here are my 2 cts.

First I think that we need to split the population in two:

the ones interested in parsing (very few actually, probably students)
the ones willing to use a parser as a feature but not so versed in syntax matters (<== target population)

What I think could help increasing adoption for population 2:

out-of-the-box-ness
Getting a proper maven artifact: nowadays, people just want to include 3 lines in their pom and start playing. I am certain that many people out there won't make the effort to use a library that presents itself as source code. IN large corporation for instances, they even only promote properly threat-scanned libraries as cyber issues gain importance. And don't event try to explain that a parser is not concerned by this. They just want clean scanned binaries.
versioning
I know that CCC is a WIP but I have the feeling that not releasing versions gives the (false) impression that the product 'is not finished'. Again, even if this point is not relevant for population 1, general adopters need to use versionned bricks for their released software. I say this as I come from a large corporation where in such case, one would have to create manually their own binary artifact to freeze a version.
diagnostics
You said that you think people should not need DEBUG_PARSER... I think that you also removed "choice conflict at XXX, consider a LOOKAHEAD" message. My take is that to understand grammars, lookahead, ambiguities etc. you need to really be into-it and the learning curve can be quite discouraging. I think people just abandon when its too difficult. That's why anything that can help people refining grammars without resorting to breakpoints would help adoption I think. Actually this point is almost the same as the one above, i.e. consider CongoCC as a binary facility. Writing a grammar for most people looks like a REPL activity. And since the parser if generated code, the loop is essentially long in time. I personally re-generate/recompile all parser classes at each pass. I know that is is maybe not the proper academic way to think-before-you-code, but, as I see my son doing with his python programs, copying snippets without real understanding is the trend. So that's why DEBUG_XXX and "choice conflicts" messages are probably a good way to keep people from getting discourage to quickly.

Indeed, however, the lack of interest is strange to me...
There's a need to market this to new users, probably people who never used a parser generator before.
Absolutely

revusky

ngx BTW using the token type integer field is for efficiency. Plus I had a guru OO guy telling me that if you use instanceof, something is wrong in your design. This was really 30 years ago :-)

Well... the guy you're referring to may be a genuine guru, but I don't think I would join his ashram! LOL.

I mean, finally, the statement that you're doing something wrong if you use instanceof... to me, that comes across as very dogmatic and impractical. I certainly can't really buy the idea that instanceof is inherently evil the way some totally unstructured goto is.

And I don't really buy the performance angle. It may be that in very early versions of Java, casts and instanceof, a.k.a. RTTI, came with a lot of overhead. But my sense is that this has not been the case for a long time. (I say that's my sense of things because I never really looked into it any serious way, I have to admit.) Now, I'm quite happy that generics allows you to write code with fewer casts or instanceof checks, but to me, that's pretty much entirely a readability question. I think it's best just to write things in the most straightforward way you can, and if that really entails some big performance issues for some reason, then, sure, deal with it. But if not... that's surely what Knuth was getting at with the famous line about "premature optimization".

As for the difficulty of marketing the project, well... hmm.... a problem that is real that is maybe a bit delicate to mention is what I have referred to as the Norbert Sudhaus Fan Club, which is really about a certain kind of sheer cussedness on the part of certain people. I'm loath to dwell on this too much, though I have mentioned it at times. But it is very real. You know, when I picked up this work again, I figured that certainly most existing JavaCC users would be delighted that somebody like me comes along and actually picks this up after so many years of neglect and does something with it. I infer that that is your reaction, but it's far from universal.

I mean, there are these people who you could call "influencers" in the space, like the guy I referred to in the Norbert Sudhaus Fan club article. He (and he's not the only one) just makes a point of never acknowledging my work. And, you know, the legacy JavaCC is really kind of the archetypal Nothingburger project (the mother of all nothingburgers, to paraphrase Saddam Hussein!) and these kinds of people I'm referring to really do resent anybody who can do any meaningful work on something. It's not something that I'm just imagining! It's real!

But note that, I'm not even saying that this is the reason that most people who could benefit from this aren't using it. The typical reason that any given person is not using this updated version is simply that they don't know about it. But then, why don't they know about it? A big part of that is that a certain group of people have decided to cancel it. You know, cancel culture. (That was a term I was unaware of until a year ago or so and came across it in another context.)

Now, as for the other point about these various choice conflict warnings and such -- that having been removed -- I don't reject the possibility of putting back in some sanity checks, but I don't think the existing ones were all that useful really. I think even that some of the messages it emitted were rather wrong-headed. Also, by the way, the code that analyzed and generated those messages was so grotesque that I finally applied the Gordian knot solution to it.

But still, I would say that all of that is really a separate conversation, because I don't think that anybody is declining to use this improved version of the tool because those various warning messages were so important to them. And likewise with this DEBUG_PARSER and DEBUG_TOKEN_MANAGER stuff. I mean, again, the original JavaCC code was written around 1996 when maybe these guys didn't even have a decent debugger, so.... And, on the other hand, surely you've noticed that stack traces are much more useful than before and the code itself has far more information about the point in the grammar where it came from. So I am quite certain that it is much easier to debug a JavaCC21 generated parser than before. But, again, I can't really buy the notion that adoption has been slow because I removed (rightly or wrongly) certain things like that. The reasons lie elsewhere, I'm sure.

But what to do then? Just plough forward mostly. Also, I guess I expended a lot more energy in terms of keeping the older syntax working and things like that than I should have. My current plan is just to have the last version that is labeled JavaCC21 include syntax converter and then all the later versions called CongoCC will only support the newer syntax. Surely you don't miss the older syntax, do you?

ngx

Dont get me wrong, I am a great supporter of your work, and I dont see any other parser generator competing. So anyone with a parsing problem must come to javacc21 without doubt. I am just saying that we are all in some comfort zone and, I see it with procrastination for example, people are less prone to efforts. if you take the REPL approach in python jupyter notebooks, it reminds me of early BASIC days! What I am trying to say is that the try/error approach is frequent, as opposed to sit down with a pencil, think and then code. To be honest, it can be a way to understand retroactively, after the fact.

So in my case, when attempting to write pieces of a SQL tolerant grammar (with 'uninteresting' token sequence portions), which is definitely not LL(1) with furthermore mixed IDs/Keywords (because we try to parse multiple dialects - maybe I will use token activation at some point, if dialect is known...), I find myself with a lot of ambiguous choice points.

Are you planning to have an OLD syntax to NEW syntax converter? This would be a great start for old users to convert.

Also on the same token, an EBNF to NEW converter would be also great, even if partially correct.
Using pure lexical approaches to do this have limits.

Just for you to get an idea of the task ahead of me, take the example of the only SELECT statement declined in 3 SQL dialects:

As you can see, it is horrible in may ways: in the description (no BNF), its fuzzy, its big, and maybe simplly impossible to parse completely in one parser only. And I did not even consider adding Sybase, DB2, Microsoft... versions! Maybe this task is just impossible (?)

revusky

ngx Dont get me wrong, I am a great supporter of your work, and I dont see any other parser generator competing.

Well, not to worry, I mentally catalogued you from the start as being on my side!

You know, a funny thing is that I've had this sort of idea at times that I could start some internet meme -- not so much in my computer nerd activities, but in some political writing I indulged in. But I never managed to really get a meme going, like I thought this meme "Roger Rabbit narrative" that I came up with about 6 years ago had a chance, but it never really took off.

Funny thing about memes though, I just noticed that if you type the string "Norbert Sudhaus" into google, my blog piece on the Norbert Sudhaus Fan Club is the very first hit! And funny enough, that's on the basis of a single article on an obscure technical blog. If I really set as a goal to push the Norbert Sudhaus meme, it looks totally doable. (Probably I won't bother, though...)

By the way, it's okay if you think that I shouldn't have written that. There's a side to me that thinks I shouldn't have written that, but I did. It may well have been self-indulgent on my part, but it was just an expression of my frustration with certain people. Or a certain kind of attitude I was running into continually. But it's certainly not that I have any cause to retract anything I said there. I'm satisfied that it's all quite true. (Unfortunately.)

Not that the Norbert Sudhaus Fan Club phenomenon is the sole thing going on. It's also true that there just is a lot of inertia out there. I mean, a lot of the projects out there that use the legacy JavaCC are in a pretty sorry state themselves. And very often, the guy who used JavaCC to write whatever parser used in that project is long gone and none of the people left really understand how it works. And I'm pretty sure that is the main reason that, if I show up and say: "Hey, look, here's this new improved version of JavaCC. It's really hot shit, guys!" they don't take much notice. And, I think, BTW, that it is largely in that context, that they then ask you whether it is a 100% BWC replacement (which it is not) and the issue of whether it is on Maven Central comes up. And well, that is why I'm skeptical about the whole Maven Central thing. I mean, finally, my sense is that having it on Maven central makes little difference, because what these people really want is to just edit a single line in a pom.xml file and then be using the newer version. And once they realise that this is not possible, that it requires some tweaking of their code, then they're not interested. I mean, the guru who wrote their JavaCC grammar is long gone (presumably went off to found another ashram!) and they're terrified of touching any of that.

ngx Are you planning to have an OLD syntax to NEW syntax converter? This would be a great start for old users to convert.

Yeah, I've been planning to do that for the longest time and it's not even that difficult. It's just that I never buckled down and did it. My approximate plan is to have the converter working as part of JavaCC21, but once everything has shifted over to the Congo renaming, then there's just no support for the older syntax.

So then I can get rid of a lot of cruft in the core grammar. But it's not just that. I think we'll just change tack completely and hardly ever mention the legacy JavaCC, just mention it at the footnote level here and there that CongoCC does have its origins in a rewrite of that old JavaCC thing, it's not a secret or anything, but a new user won't give a **** about that anyway. Why should they? All the more so if it's a Python or C# guy!

ngx And I did not even consider adding Sybase, DB2, Microsoft... versions! Maybe this task is just impossible (?)

Well, it's challenging for sure, but the thing about it is that undertaking a very challenging task like that is bound to result in improvements to the tool. That C# grammar I wrote recently was really difficult and I certainly ended up nailing a couple of bugs in the tool in the process of writing it. I also became conscious of certain limitations, where I could see my way to expressing things more cleanly if the tool had certain features. So I'll probably be addressing that in the near future.

I mean, with the legacy JavaCC, the task probably is basically impossible. It's not just that the thing is missing very basic features, like INCLUDE and such, and it has this very fundamental problem of not being able to to do nested syntactic lookahead, but even there's just this fundamental problem that the thing is dead development-wise. There is just no prospect of any needed feature ever being added. If you run into a problem with your SQL work and it's like, well, I need such and such, and it's not too hard to implement the feature, well... we'll do it! Or better, since the codebase is so much cleaner now, you could learn the internals and do it yourself. (I think you'd end up seeing that it's not as difficult as you might suppose...)

By the way, what are your goals with the SQL work? Are you just intending to open source it or do you have some other plan?

ngx

Hey!

Why there is little enthusiasm for JavaCC21 is mysterious. In my experience, common sense always wins (but it may take long). Indeed, the traditional javaCC repo evolutions are very cosmetic and not addressing real points. All the evolutions your version proposes are not only good, they are very good. (apart as you know the removal of signaling ambiguous grammars when the parser generator knows about those ambiguous choice points, which I am still not sure to understand why it is problematic to indicate this as a trace message when the user set the flag). Also, I never understood the need for parsing if it is not to get an AST, otherwise it is called lexing, so your Tree builder enabled by default is evident. Also all the syntax simplification you made is excellent, the catch-up '!' operator is truly a must-have...

Now what is in people's mind is hard to predict... The whole earth is on the verge of apocalypse just because of some crazy mind somewhere.., but digressing...

On the open-source question, it not really the plan: I have been selling my product for around 20 years and as the parser is its main constituent, giving it away would void the entire product itself. And I would like to keep this (modest) revenue source as I will retire soon! I am however open to share any parsing tricks I will find if I finally go ahead with this project - I am just evaluating the work right now....

revusky

ngx Also, I never understood the need for parsing if it is not to get an AST, otherwise it is called lexing, so your Tree builder enabled by default is evident.

Yeah, I certainly think so. I think that building a tree is bound to be the most typical usage pattern for a tool like this. So it just doesn't make any sense to be pretending that tree building is some specialized usage that requires a separate specialized tool.

Well, one big aspect of all that is that we're operating in a world in which memory is so cheap, like less than a penny a megabyte. In the world where these dinosaur tools like YACC were developed, a meg of RAM was something more like a million dollars! So, okay, those guys would have balked at building a full tree by default, because by their standards, that was crazy expensive in terms of RAM. Well, I said all that in a blog post I wrote almost exactly 2 years ago, the gigabyte is the new megabyte. So you could say that a lot of the overhaul of of the older code was based on that basic realisation. So, as you likely know, I eventually just got rid of all that kind of thing. JavaCC 21 now sucks in all the input from the start and works entirely in memory.

Well, RAM being so ridiculously cheap as to hardly worth worrying about (usually) is one thing, but there's also this other sort of more general, I dunno.... psychosocial issue here. I mean, in these projects that I call nothingburgers (another attempt to get a meme going, I guess) what you typically have is this extreme, fawning reverence for the original work. So, anybody proposing any fundamental change, it's kind of like if you proposed to some traditional religious people to rewrite passages in the Bible, like: Hey, guys, wouldn't it sound a bit better like this? It's literally like sacrilege or something.

So, obviously, as a result of that, you end up with this situation where totally archaic things are kept going indefinitely -- things that probably made sense back when but no longer do, like acting as if 64K of RAM is really a big deal. Or just never revisiting bad design decisions that were made back when....

In terms of JavaCC, the original JavaCC did not have any tree building component, and then (I assume because of end-user demand) automatic tree building was sort of bolted on, with this JJTree thing, but that was not written by the original team that created the core JavaCC tool. (I got clarification on that from Sriram Sankar in private email.) So that also explains why Token does not implement Node. That is the most natural thing, isn't it? The idea that the individual Tokens are typically the terminal nodes in the AST. I infer that the reason that it isn't like that is because the original author of JJTree (guy named Rob Duncan IIRC) set himself the task of developing this tree-building tool without making any changes at all to the core JavaCC. In fact, that's why it's implemented as a preprocessor, not as additional functionality in the core tool.

And once you know that ancient history, you can also start understanding other bizarre things, like why does the core JavaCC tool have a means of injecting code into the parser, i.e. PARSER_BEGIN...PARSER_END and TOKEN_MGR_DECLS but there is no way of injecting code into any generated Node classes with JJTree. And then, in FreeCC (which was the original name of my "fork") all this was unified back in 2008 with an INJECT directive. And then I showed up to the JavaCC "community" and told them what I had done (not a proposal or anything, real working code) and they drove me out with pitchforks basically.

But the JJTree thing was created in its current state in early or mid 1997 and it suffers all these fundamental design problems, but nobody ever did anything. (And it's been 25 years!) And now since I'm going on about this whole thing (though I should stop maybe!) one perverse result of all this is a large percentage of JavaCC-based projects (I quantified it as approximately half at some point) eschew the user of JJTree (for various well founded reasons actually) and then they build their own tree in the code actions. Well, granted, that does have the advantage that they build the exact tree with the exact API they want. That's true, as far as it goes, yeah, but there's just basically the problem that this ends up being a huge duplication of work. I think it's much more appealing to offer a standard Node API with the various methods for filtering and traversing the subnodes just working out of the box. So, one could envisage going from one CongoCC-based project to another and you're up to speed fairly fast because the Node manipulation/traversal API (which, in practice, is what an application programmer mostly works with) is the same. I mean, finally, everybody defining their own Node API etc is kind of like the good old days (that were not necessarily so good really) where developers were constantly writing their own Hashtable code and things like this. (I gather you are also old enough to remember that. So are Vinay and myself!)

But this whole hostility and resentment towards somebody who wants to do an overhaul of some crufty old thing like this JavaCC -- that's just.... Well, it is disheartening, but hey, as you say, it's nothing like these politicians blithely wandering into WW3.

hylkevds

I think a lack of interest is for a large part caused by people simply not finding the project.
I found it while searching for a solution to switching lexical states from the grammar, finding your blog post about it, trying it, being confused, and then realizing this was not about the JavaCC I was using, but a different project!

A clear, different name may have made me realise the difference sooner, but it may also have caused me to not find the project (given I'm currently using the legacy project) Google is also not helping much, since I originally did not find this project at all, but that may also be caused by the name being too similar to the legacy one.

Next problem is probably tutorials, especially for the new syntax and new users. They should not be exposed to the old JavaCC syntax at all if it can be helped! A "how to start writing your grammar" guide might help with that. Preferably with some example projects using maven, gradle and ant, that generate a parser as part of building, and then run some tests against the generated parser.
The (non) integration with build systems doesn't help here. Not just for users migrating from Legacy JavaCC, but also for new users. Having to put a jar in GIT, or using wget to fetch a jar as part of a build pipeline will turn away most potential users.

Getting existing Legacy-JavaCC users to convert is probably very hard to do, unless they really need some of the new features. Like you already mentioned, the original author of their grammar is probably long gone, and they probably don't want to rewrite it (if they are even capable) .

vsajip

hylkevds The (non) integration with build systems doesn't help here. Not just for users migrating from Legacy JavaCC, but also for new users. Having to put a jar in GIT, or using wget to fetch a jar as part of a build pipeline will turn away most potential users.

Not sure what you mean:

The jar is only put into version control when hacking on CongoCC itself, and AFAIK there's no need to do that if you are just using it to process grammars. It's just a tool that needs to be on the path, like java and javac.

It's quite common to do the equivalent of "wget" to fetch things in build (and test) pipelines these days. Below is a real-world example of setting up IronPython on multiple platforms in a GitHub Actions CI/CD workflow for one of my projects:

    - name: Setup IronPython 2.7.11 (Ubuntu)
      if: ${{ matrix.os == 'ubuntu-20.04' }}
      run: |
        pwd
        echo ==================
        ls -l
        echo ==================
        wget -q https://github.com/IronLanguages/ironpython2/releases/download/ipy-2.7.11/ironpython_2.7.11.deb -O ironpython_2.7.11.deb
        sudo dpkg -i ironpython_2.7.11.deb
    - name: Setup IronPython 2.7.11 (macOS)
      if: ${{ matrix.os == 'macos-latest' }}
      run: |
        pwd
        echo ==================
        ls -l
        echo ==================
        wget -q https://github.com/IronLanguages/ironpython2/releases/download/ipy-2.7.11/IronPython-2.7.11.pkg -O IronPython-2.7.11.pkg
        sudo installer -pkg IronPython-2.7.11.pkg -target /
    - name: Setup IronPython 2.7.11 (Windows)
      if: ${{ matrix.os == 'windows-latest' }}
      run: |
        $source = 'https://github.com/IronLanguages/ironpython2/releases/download/ipy-2.7.11/IronPython.2.7.11.zip'
        $destination = "${env:USERPROFILE}/bin/IronPython.2.7.11.zip"
        Invoke-WebRequest -Uri $source -OutFile $destination
        Expand-Archive -Force $destination ${env:USERPROFILE}/bin/IronPython-2.7.11

What suggestion do you have instead of the wget or equivalent thereof, which you think would not turn away users?

adMartem

As a (hopefully, former) JavaCC user, I think I disagree with some of your points. Perhaps having to finding out that CongoCC exists almost accidentally is an impediment, but starting to use it and coming up to reasonable speed did not seem to me to be a problem. It was easy-peasy to clone the git repo, and then just build it with the ant script. When you do, the examples become obvious. I admit searching the blog(s) for documentation is not ideal, but it is fruitful for the most part.

In my case, finding the existence of CongoCC (ne. JavaCC21) was kind of like happily finding an easter egg.

hylkevds

vsajip What suggestion do you have instead of the wget or equivalent thereof, which you think would not turn away users?

A maven/gradle plugin... 🙂
And versioned releases that are guaranteed to not change/vanish in the future.

Your GitHub Actions CI/CD workflow snippet is, IMHO, a good example of a bad idea. Now you have version numbers in multiple places, and the version used in the pipeline may be different from the version on your dev machine, or someone else's machine. This is also why maven/gradle wrappers are so popular: They guarantee the build environment is versioned, in the gradle case it can even version the JDK used for the build. And you now have to manage that code in the build script of every project that uses it...

Standardised version management (like in a pom.xml) also allows tools like Dependabot to check if newer versions of dependencies are available, and generate automatic pull requests, which are then automatically tested for regressions by the build pipeline, which uses things like Testcontainers to set up external components to test against... This minimises the effort involved in dependency management.

Repeatability of builds is very important when projects grow, and every tool/dependency that tries to do things its own way makes repeatability harder to achieve.

vsajip

hylkevds And versioned releases that are guaranteed to not change/vanish in the future.

I agree with this part - proper versioning is important. I'm not yet sure of the value of Maven/Gradle plugins for CongoCC, though. I think there are existing plugins that will download artifacts for you, and cache them if needed. My own experience of putting a new library on Maven Central was, I have to say, an exercise in frustration. It requires a whole lot of steps, which aren't terribly well documented by any official source - there are numerous third-party guides, like this one, but I have to say there was some element of trial-and-error when following these guides and I had to consult several of them before I had any success.

Apart from Gradle's poor performance, I ran into random issues, which don't seem to crop up with other tools - here's just one Gradle example I personally ran into.

hylkevds Your GitHub Actions CI/CD workflow snippet is, IMHO, a good example of a bad idea. Now you have version numbers in multiple places

I come from a Python background, where practicality beats purity. The snippet I posted doesn't exempify the approach I'd take everywhere - I'm quite awate of DRY. If GitHub actions didn't use YAML and instead had some format where you could easily specify variables, I'd use a variable for the version. I gave the snippet as an example of where you just have to use wget-type functionality in a build pipeline - the world is bigger than just Java/Maven/Gradle, and those other systems are prevalent in professional contexts just as Java is, and their communities also value proper versioning, reproducibility etc. But sometimes, you just have to work with what's available.

hylkevds Repeatability of builds is very important when projects grow

I agree with that too.

vsajip

Regarding the initial post: the "named children" feature has now been implemented and is being used to improve the polyglot (Python/C# parser generation) support code.