ngx
it is so slow that it is impractical to use in production for large grammars. Is it the price to pay for not having to specifiy lookaheads by hand?)
Oh, I honestly don't know why ANTLR is so slow. My sense is that it may not be that slow generally speaking but there must be some constructs for which it generates painfully slow code. My best guess is that it shouldn't be that much slower generally. Probably they need to buckle down and figure it out. I don't really know. But, as anyone can see, the situation persists year after year and decade after decade. I think I played with it for the last time a couple of years ago and I noted that memory usage can be quite terrifying as well!
Well, anyway, I should tell you first of all that I did a bit more work on the "code too large" problem today and it should really be gone now. As things stood, the XXX_init method would still hit "code too large" if a single lexical state had somewhere north of 6000 NFA states. I didn't think that was very common, since the largest lexical grammars, like C# and Java never even reached 1000 NFA states. Actually, when I wrote my (overly triumphant) post The Dreaded “Code too large” Problem is a Thing of the Past nearly 4 years ago, I wrote the following:
As a final point, the above example assumes that no individual array is so big as to hit the "Code too large" limitation on its own. And that does seem to be the case, in practice, with JavaCC. Fairly obviously, if any single array was big enough, on its own, to hit this limitation, you would need multiple XXX_populate() methods. So, if your foo[] array had, let's say 20,000 elements, you could generate:
And so on... Basically, it puts all the NFA acceptance methods in a big array and if the array is bigger than about 6000 elements, you need to initialize it in more than one XXX_init method. But I never bothered to code the logic for that because it didn't seem that that ever really happens... (WRONG.)
But now it should be okay. See: https://github.com/congo-cc/congo-parser-generator/blob/main/src/templates/java/NfaCode.java.ftl#L30-L71 and you can see how the problem was addressed. It's really very trivial.
If the NFA_FUNCTIONS array is very big, it breaks the initialization into multiple methods now. So if you rebuild with the latest code the problem should be gone. Or you can just grab a pre-built jar.
As for the other stuff, well, the "choice conflict" thing, the history of it is that the code that implemented that in legacy JavaCC was just horrible. And I kept thinking I would clean it up. It was written in such an opaque manner that I finally just decided that it was beyond my ability to fix it. So I just tore it out. I was probably thinking at the time that I would address it later.
The thing is, though, that if you are working up a grammar in CongoCC, the whole choice conflict thing is of limited use. I mean if you have:
<DO> <FOO> ...
|
<DO> <BAR> ...
Sure, the second choice is unreachable if you only look ahead the default single token. But the usefulness of having the tool tell you that is really pretty marginal. I mean, if you're writing a grammar, you typically are incrementally testing the parser on typical input and you would see that the above just doesn't work, like it never enters the second choice if you don't specify any lookahead or up-to-here. You need:
<DO> <FOO> =>|| ...
|
<DO> <BAR> ....
So... Of course, if you are using a tool like ANTLR that has a runtime engine that sorts this out for you and you want to convert the entire potentially huge grammar to CongoCC, I guess that could be a problem because suddenly you have all these cases where you're missing the up-to-here.
I'm now thinking about what to do about this...
As for the tracing, to be honest, I was always kind of skeptical about how JavaCC dealt with this. With a very big grammar, it would tend to output so much that.... I mean, in general, it's a very crude disposition, no? Now, of course, with CongoCC you can just put println() (or log) calls in a code action in the productions you're interested in, no? You can also use the life-cycle hooks that exist for tree-building. For example, if you want to inject some code at the point of starting a new node or closing one in the tree-building process, you can define methods with the magic names OPEN_NODE_HOOK
and CLOSE_NODE_HOOK
respectively. So, you could have:
INJECT PARSER_CLASS :
{
void OPEN_NODE_HOOK(Node n) {
do something by default at the moment of opening a new node.
}
void CLOSE_NODE_HOOK(Node n) {
do something by default at the moment of closing a node.
}
}
If you really really want, you have access to the call stack, actually, it's called parsingStack
internally, though maybe I shouldn't encourage you (I mean not you specifically but any application programmer) to use things like that... But you can. Actually, it's probably not so bad an idea if you only use it in a read-only manner...
But, what I mean is that, what with the ability to define life-cycle hooks and so on...
Well, also, you have assertions. You can pepper your grammar code with ASSERT statements to verify things about the state of the parser or tree or whatever. Actually, for example, you could write something like:
ASSERT {!isInProduction("Foobar")} : "We are not supposed to be in a Foobar production here!"
The assertion fails if we are currently in a Foobar
production. Well, okay, stuff like this is not particularly documented, but I mean the range of debugging tricks available in CongoCC compared to legacy JavaCC is...
Well, anyway, do tell me if the "code too large" issue is fixed. I'm going to think about maybe adding back in some choice conflict (really dead code checking in practice), maybe not exhaustive, but that would cover typical situations.