adMartem
I had "assumed" the grammar would work like a PEG grammar (although I had no knowledge of PEG at that time) and process the rule's choices left to right, rather than as a real CFG would. Boy was I misguided.
Yeah, well, the thing is that surely this is just how anybody would intuit the question. You have a series of options, possible branches, and you go through them in order and the first one that matches, you go with that one...
But, you know, generally speaking, all of this parsing theory pretty much has me perplexed, drives me nuts. I suppose you know the old quip, that modern abstract art is some massive conspiracy to convince ordinary people that they're stupid. So, I was wondering the same thing about some of this parsing theory. The context-free grammar (CFG) concept, that all of the branches in a choice have equal priority... As an implementer, how does that even work? It just seems pretty obvious to me that the earlier choices would have priority, no? I mean, the procedural programming that we all cut our teeth on tends to be like:
if (conditionA) {do A}
else if (conditionB) {do B}
....
else {We have a problem, Houston...}
So you go through and you try the choices in order. And once you match a given condition and go into the corresponding block, you don't try to match any of the later ones! If conditionA matches, then you do A, and you don't even check conditionB or any of the later ones. To be saying that you check all the conditions and if more than one is true, you have a... drumroll... ambiguity... The original JavaCC had this really horrendous code in there to check for these "ambiguities", except to my mind, they are not ambiguities. They're just dead code. So if you write:
Expression ("," Expression)* [","]
then using 1-token lookahead logic (which is what you have if you don't specify anything else) then the optional comma at the end will never be hit. You probably noticed that I put back the warnings for these things half a year ago or so, but the warning message does not say that there is any "ambiguity". It just says that the second comma is unreachable, which means you should have look at what you're doing there! But all this talk of "ambiguity"...
So, kudos to Bryan Ford for deciding that the earlier specified branch has priority. I was looking at this again just today and I see (I saw it before, I think) that Ford was experimenting with "memoization" to speed up the parsing. So he'd cache results of certain paths through the input, I guess. Actually, if you go through a CongoCC parser step by a step in a debugger, you do see that it's frequently scanning over the same sequence of tokens again and again. So one could wonder whether memoization could speed things up significantly. I thought about it somewhat, but I have my doubts by how much. Unless you're switching between lexical states during the lookahead (which is not the most typical case) all the tokens are cached, so you're just scanning a number of them and checking the tokenType and I think it's really pretty fast. After all, you type ant test
and it builds the various parsers -- Java, Python, C# -- and in each case, parses through some thousands of files. I honestly doubt that many people have a practical need to parse code any faster than that! I dunno for sure. It's not like people show up and say they want to use this but it's too slow. But not many people show up and say anything at all, so...
But I just don't relate to all that blather about "ambiguity". Of course, natural language is full of ambiguity highly dependent on context. But you'd think that a parsing system like CongoCC just wouldn't have any ambiguity really. If you can build the parser code, it compiles and runs, then its behavior is pretty determinate. Even left recursion. There was a check for that in legacy JavaCC, but I just pulled it out at some point, because, like with the other "ambiguity" checks, the code was just horrible and I figured it would be easier to just write it from scratch at a later point. So I did it redo it and now it warns you about left recursion and exits, but before that, it would run, and you'd get a stack overflow... but that does not strike me as an ambiguous result. It looks pretty determinate to me!
Well, I could rant more about all this. Oh, here is one thing that I resolved (to my satisfaction anyway) some time ago, which is, you know, when you have two regexp specs that will match the input. Well, you know, the longest match wins, of course. But if they match the same length of input, then it is the earlier specified regexp that wins out. (Strangely, nobody says that is an ambiguity!) So, you can have a problem if you specify a literal word after the definition of an identifier so you have:
<ID : (["a"-"z", "A"-"Z"])+>
|
<FOO : "foo" >
Of course, FOO will never be matched because if there is a foo, it will always be matched by the ID regexp, and since that is the earlier specified one... So, actually, the original JavaCC had some (again, totally obfuscated) code to check for these things and warn you. So, I removed that and, as in the other cases, left the user on his own. But then (as in the other cases) I just finally decided that if somebody specified a literal string and also a regexp that matches that literal string, they always will want the literal string to take priority. So I quietly changed things so it works that way. Basically, I just sort them so that the literal strings always occur earlier, or it's as if they did -- even if, in the file, they don't. I just couldn't conceive of when one would specify a literal string as a pattern and not want that to have priority over something more general.
But regardless of that, in a well specified system, how can you really have ambiguity? It surely just means that you haven't specified things precisely enough, no? Oh, and by the way, I just reread this somewhat sarcastic blog article that I wrote nearly 5 years ago, and really, I don't think my opinion has changed particularly. Well, I did put back in the warnings about the dead code, -- well, really just the ones that it is easy to warn about, where we're operating with single-token lookahed and no other conditions (like lookahead predicates expressed in Java code) that could complicate matters.
EOR (End of Rant)