Some Directions for Future CongoCC Development

revusky

I've had these thoughts beating around in my head for at least a couple of months now but just haven't buckled down to write about it. I'm finally doing that, in the hope of getting some discussion going.

I guess the main overriding idea I want to put out there is that the CongoCC project should evolve towards becoming less monolithic.

For example, we have the examples directory which contains fully functional parsers for Java, Python, C#, and Lua. These things are really quite complete and robust and potentially quite useful to people out there. However, very few (hardly any) people are getting any benefit from this, and that is surely for a very basic reason:

They don't know that this body of work exists!

I think that is generally the case, but I would add that, even when they do know that it exists -- for example, that CongoCC contains within it a fully functional Java parser that anybody can use -- they don't really know how to leverage that.

Now, for one thing, the Java parser (which is just one specific case), even though it lives in the examples directory, is not really just an "example". It is actually an integral part of the overall CongoCC tool. It is not any sort of toy example either. (Actually, as a pedagogical example for a CongoCC neophyte, the Java grammar is surely way too big and complex.) This really is a complete, robust parser for the Java language! Well, again, the bottom line is that there are surely plenty of people out there who are not in the market for a parser generator, but would really like to have a fully correct parser for the Java language that they could leverage in their own work. However, they don't know that this even exists!

I would have to admit that my thinking is still a bit wooly-minded on this, but I really think we should see our way to packaging the Java parser as a separate project (or sub-project, if you will...) that can be used on its own.

Of course, the same considerations would also apply to the C# and Python parsers. The Lua parser is a bit different because we don't have any goal of generating parsers in Lua itself. Also, the Lua language is quite a bit smaller really than the other languages, so simply presenting the Lua grammar as a relatively small example that people can study makes more sense in that case. (Though there is the same point that it is also a complete useful tool in itself, not just a toy example.) But, regardless, in the case of the Java/C#/Python troika, I think that packaging them as separate projects (or sub-projects) really could be appealing.

But... we would have to start off one step at a time, so I think that packaging the Java grammar/parser and announcing it separately even, that would be the initial step. Another way of conceptualizing this is that the Java parser could emerge more as a separate project in the same way that FreeMarker (FreeMarker 3, the version under our control) is. As most people reading this probably realize, CongoCC and FreeMarker have a nifty mutual dependecy relationship: the latest FreeMarker is built using the latest CongoCC and the latest CongoCC is built using the latest FreeMarker. And I think that having a similar mutual dependency relationship with the Java parser (reconfigured as a separate project) would be equally nifty. CongoCC uses the Java parser and the Java parser in turn is built using CongoCC. I also tend to think, by the way, that all of these mutual interdependencies tend to create a development environment which is not very conducive to bugs surviving for very long!

Oh, and I would add that the above is a very basic advantage that CongoCC has over its main competitor, ANTLR. Well, it's a nuanced question actually. ANTLR does have this huge repository of grammars for various languages but these grammars do not have the same relationship with the core tool that our Java parser has with the CongoCC core tool. For example, they have several Java grammars in that repository, but I am rather dubious about their commitment to maintaining them. By contrast, our Java parsing capability is a core part of the tool, so we are absolutely committed to the Java parser component being completely correct and robust. Well, it might well have whatever glitches at any point in time, but there is a very strong tendency for them to come to our attention and thus, get fixed. Or, to put it another way, anybody reporting a bug in the Java grammar is, in essence, reporting a bug in the core tool. That is not really the situation with ANTLR's (admittedly vast) repository of contributed grammars.

Now, maybe the main reason that I would really like to define the Java parser (and eventually the C# and Python parsers) as separate projects is that the scope of CongoCC is just way too large. Just maintaining up-to-date grammars for the various languages is already a pretty big commitment. In fact, the grammars have slipped behind the cutting edge a bit, though, for the moment, these parsers can still handle the vast majority of code in the respective languages in the wild. For example, as regards the Java language, I think the only stable language feature that Congo's Java grammar does not currently handle is Record Patterns which is a stable feature as of JDK 21. I doubt that, at the moment, that is widely used, but that will change surely. There are a bunch of preview features that we're not (yet) supporting, but I don't see too much need to support such new features until they are marked as stable.

But I think there really is a need to attract more contributors. I think the current monolithic structure of the project makes that very difficult, because, for most people it's just too big an intellectual investment (or they perceive it that way, which amounts to the same thing) to get their heads around the overall project. If all someone is really interested in is the Java parser component, then this should provide an entry point into the project that requires far less up-front investment.

Now, in terms of the Java parser/grammar (and eventually the C# and Python parsers) being separate standalone projects, here are a couple of things we really need to nail down:

Fault-tolerant parsing
Polyglot parser generation (we definitely need INJECT to work not just for Java)

The standalone Java parser is fairly useful to a lot of people even without fault-tolerant parsing functionality, but is surely much more useful across a wide variety of situations with fault-tolerant working. (And that applies, of course, to C# and Python parsers.) Parsers generated in non-Java languages are less generally useful at the moment because we don't have INJECT working...

In terms of getting fault-tolerant really squared away, having a separate user base that uses the parsers in fault-tolerant mode would hopefully get us a lot of feedback about the various error recovery cases that a parser is not handling well. (At least assuming that we can have a few noisy end-users...)

Well, I guess I'll close this message here. I could say more, but then the post would get too monolithic. LOL. Oh, it occurs to me also to add that I own the domain parsers.org and I think it would make perfect sense to develop subdomain sites like java.parsers.org and python.parsers.org. But, you know, that's yet another thing. There is so much work to be done even in terms of having a decent web presence. Well, that's just one more front on which I certainly have been pretty deficient, mea culpa, but still, there just is the problem that the project is totally undermanned, so I am thinking that maybe breaking this monolithic project into smaller more bite-sized pieces could help in terms of attracting collaborators.

vsajip

revusky But I think there really is a need to attract more contributors. I think the current monolithic structure of the project makes that very difficult, because, for most people it's just too big an intellectual investment (or they perceive it that way, which amounts to the same thing) to get their heads around the overall project

I think the biggest barrier to lack of contributors (and users) is lack of documentation. While I have no objection to splitting out the parsers to separate-projects (ideally, in a way that doesn't require large scale changes to the existing project - e.g. a script to copy/transform the relevant bits to a new directory), I strongly doubt if it will, by itself, lead to any big changes in the number of contributors and users,

revusky

vsajip

Well, I obviously can't disagree with you that there really ought to be much better docs. However, I would point out that, if one breaks off a given subproject (and I'll take the Java parser/grammar as an example) and documents that, one is effectively documenting the main tool -- or at least significant parts of it. As a concrete example, one could consider a very simple example of leveraging the existing Java grammar, something like:

    PARSER_PACKAGE="my.javaparser";
    NODE_PACKAGE="my.java.parser.ast";
    JAVA_UNICODE_ESCAPE;
    DEFAULT_LEXICAL_STATE=JAVA;
    FAULT_TOLERANT=true;
    
    INCLUDE JAVA

    INJECT ClassDeclaration :
    {
           java.util.List<MethodDeclaration> getMethods() {
               return childrenOfType(MethodDeclaration.class);
           }
    }

All that the above example grammar does is that includes the Java grammar and injects one convenience method into the ClassDeclaration node. I mean, just to explain the above trivial example, you have to explain:

The whole way that INJECT works, that you can just insert a convenience method into a node.
Various options at the top of the file, like the fact that you can change the packages into which you generate the parser and nodes.
Almost certainly, one would end up providing a trivial example of the use of the Node.Visitor base class to traverse the nodes in an AST.

So, we would inevitably end up documenting these various things, except that it would be more focused on the specific question of somebody re-using the Java grammar. (Or the Python grammar or whatever...)

But, I would say that, in terms of using the Java (or the Python or C#) grammar in one's own project, the main reason that there is not more uptake on that, even beyond the documentation issue, is simply that people don't know about it!

I would also add that, if you look back on the history of all of this, the original JavaCC project became quite popular without having much in the way of serious documentation. As I recall, the way I figured out how to use it (and I had never used a parser generator before) back in 2001 was mostly just by going through the various little examples that came with it and experimenting with them. In fact, what documentation did exist at the time, like I recall something called the Lookahead mini-tutorial, was really pretty rough-and-ready. And actually, the situation has not changed that much! See here. Probably the best resource available was the JavaCC FAQ, which is now part of the JavaCC distribution.

Now, as for the question of whether there are people interested in having a robust Java parser that they can use freely in their own projects, it is quite obvious that there is significant interest in that. For example, there is this project called JavaParser that is just that. The origin of that project, by the way, is that somebody decided to fork off the example Java grammar that was part of the javacc distro and work on that separately. And that project has been around since at least 2011 (that is where their github commit history begins) but maybe longer. And that project has some significant activity/usage statistics, compared to us anyway -- 5200 stars, 1100 forks, 145 watchers... Now, technically speaking, Javaparser is not that great a project, mostly because it is built on very shaky underpinnings. The legacy JavaCC that it is built on top of is broken in very key ways, like lookahead not working properly, but also the parsers it generates have zero concept of fault-tolerant/error-recovery. So, anything we put out there to compete with them is bound to be much more generally useful.

And, of course, Python is a very popular language -- by some measures, the MOST popular language nowadays -- and there are surely plenty of people interested in having a good way to manipulate/process Python source code. I don't honestly know what's out there. But again, we ought to be able to put out some python.parsers.org tool (defined as a separate tool, I guess you could say) that could have some uptake.

But people have to know about these things. And, again, I think that just some very straightforward usage examples to get people going could work wonders...

All that said, I am reluctant to even speculate on what would attract people's interest, since I've been so bad at that. I do have this vague notion that we have been aiming over people's heads for the most part, and we really need to present some concrete usage scenarios (and I think the ability to parse Java/Python/C# source code is a real usage scenario) and really just step people through how to use this in a very simple sort of way.

Or, to put it another way, I think that having some very concrete usage examples of that sort could actually attract people more than a complete manual. But, that said, of course I agree that a complete manual would be a good thing to have!