I've had these thoughts beating around in my head for at least a couple of months now but just haven't buckled down to write about it. I'm finally doing that, in the hope of getting some discussion going.
I guess the main overriding idea I want to put out there is that the CongoCC project should evolve towards becoming less monolithic.
For example, we have the examples
directory which contains fully functional parsers for Java, Python, C#, and Lua. These things are really quite complete and robust and potentially quite useful to people out there. However, very few (hardly any) people are getting any benefit from this, and that is surely for a very basic reason:
They don't know that this body of work exists!
I think that is generally the case, but I would add that, even when they do know that it exists -- for example, that CongoCC contains within it a fully functional Java parser that anybody can use -- they don't really know how to leverage that.
Now, for one thing, the Java parser (which is just one specific case), even though it lives in the examples
directory, is not really just an "example". It is actually an integral part of the overall CongoCC tool. It is not any sort of toy example either. (Actually, as a pedagogical example for a CongoCC neophyte, the Java grammar is surely way too big and complex.) This really is a complete, robust parser for the Java language! Well, again, the bottom line is that there are surely plenty of people out there who are not in the market for a parser generator, but would really like to have a fully correct parser for the Java language that they could leverage in their own work. However, they don't know that this even exists!
I would have to admit that my thinking is still a bit wooly-minded on this, but I really think we should see our way to packaging the Java parser as a separate project (or sub-project, if you will...) that can be used on its own.
Of course, the same considerations would also apply to the C# and Python parsers. The Lua parser is a bit different because we don't have any goal of generating parsers in Lua itself. Also, the Lua language is quite a bit smaller really than the other languages, so simply presenting the Lua grammar as a relatively small example that people can study makes more sense in that case. (Though there is the same point that it is also a complete useful tool in itself, not just a toy example.) But, regardless, in the case of the Java/C#/Python troika, I think that packaging them as separate projects (or sub-projects) really could be appealing.
But... we would have to start off one step at a time, so I think that packaging the Java grammar/parser and announcing it separately even, that would be the initial step. Another way of conceptualizing this is that the Java parser could emerge more as a separate project in the same way that FreeMarker (FreeMarker 3, the version under our control) is. As most people reading this probably realize, CongoCC and FreeMarker have a nifty mutual dependecy relationship: the latest FreeMarker is built using the latest CongoCC and the latest CongoCC is built using the latest FreeMarker. And I think that having a similar mutual dependency relationship with the Java parser (reconfigured as a separate project) would be equally nifty. CongoCC uses the Java parser and the Java parser in turn is built using CongoCC. I also tend to think, by the way, that all of these mutual interdependencies tend to create a development environment which is not very conducive to bugs surviving for very long!
Oh, and I would add that the above is a very basic advantage that CongoCC has over its main competitor, ANTLR. Well, it's a nuanced question actually. ANTLR does have this huge repository of grammars for various languages but these grammars do not have the same relationship with the core tool that our Java parser has with the CongoCC core tool. For example, they have several Java grammars in that repository, but I am rather dubious about their commitment to maintaining them. By contrast, our Java parsing capability is a core part of the tool, so we are absolutely committed to the Java parser component being completely correct and robust. Well, it might well have whatever glitches at any point in time, but there is a very strong tendency for them to come to our attention and thus, get fixed. Or, to put it another way, anybody reporting a bug in the Java grammar is, in essence, reporting a bug in the core tool. That is not really the situation with ANTLR's (admittedly vast) repository of contributed grammars.
Now, maybe the main reason that I would really like to define the Java parser (and eventually the C# and Python parsers) as separate projects is that the scope of CongoCC is just way too large. Just maintaining up-to-date grammars for the various languages is already a pretty big commitment. In fact, the grammars have slipped behind the cutting edge a bit, though, for the moment, these parsers can still handle the vast majority of code in the respective languages in the wild. For example, as regards the Java language, I think the only stable language feature that Congo's Java grammar does not currently handle is Record Patterns which is a stable feature as of JDK 21. I doubt that, at the moment, that is widely used, but that will change surely. There are a bunch of preview features that we're not (yet) supporting, but I don't see too much need to support such new features until they are marked as stable.
But I think there really is a need to attract more contributors. I think the current monolithic structure of the project makes that very difficult, because, for most people it's just too big an intellectual investment (or they perceive it that way, which amounts to the same thing) to get their heads around the overall project. If all someone is really interested in is the Java parser component, then this should provide an entry point into the project that requires far less up-front investment.
Now, in terms of the Java parser/grammar (and eventually the C# and Python parsers) being separate standalone projects, here are a couple of things we really need to nail down:
- Fault-tolerant parsing
- Polyglot parser generation (we definitely need INJECT to work not just for Java)
The standalone Java parser is fairly useful to a lot of people even without fault-tolerant parsing functionality, but is surely much more useful across a wide variety of situations with fault-tolerant working. (And that applies, of course, to C# and Python parsers.) Parsers generated in non-Java languages are less generally useful at the moment because we don't have INJECT working...
In terms of getting fault-tolerant really squared away, having a separate user base that uses the parsers in fault-tolerant mode would hopefully get us a lot of feedback about the various error recovery cases that a parser is not handling well. (At least assuming that we can have a few noisy end-users...)
Well, I guess I'll close this message here. I could say more, but then the post would get too monolithic. LOL. Oh, it occurs to me also to add that I own the domain parsers.org
and I think it would make perfect sense to develop subdomain sites like java.parsers.org
and python.parsers.org
. But, you know, that's yet another thing. There is so much work to be done even in terms of having a decent web presence. Well, that's just one more front on which I certainly have been pretty deficient, mea culpa, but still, there just is the problem that the project is totally undermanned, so I am thinking that maybe breaking this monolithic project into smaller more bite-sized pieces could help in terms of attracting collaborators.