Mucking with ANTLR

revusky

As a consequence of the discussion initiated by ngx about converting a large ANTLR grammar to CongoCC, I started looking at the ANTLR grammar (for SQL) he pointed to and I started thinking about writing (maybe) a converter -- not specifically for the SQL grammar, but for ANTLR4 grammars generally, of course.

Well, I haven't written an ANTLR->Congo converter yet, but I have done the first step, which is to write a CongoCC grammar for ANTLR grammar files. See here. The parser that this generates can parse all the files in the ANTLR grammar repository.

It is actually striking how modest a project this turned out to be. The resulting grammar is a bit over 500 lines. That compares to the CongoCC grammar which is more like 1600 lines, and that is not including the fact that it INCLUDEs the Java grammar which is as big again.

If you count the embedded Java grammar, the CongoCC grammar is something like 3000 lines, but note, of course, that the ANTLR grammar/parser has no knowledge of Java or any of the other target languages. The way that it deals with an embedded code action is simply to handle it lexically, all the characters inside the {...} delimiters. So, basically, it just matches delimiters, which is a rather crude approach. The tree that CongoCC builds from a .ccc grammar file actually contains the sub-trees for all the Java code actions and injections. And, aside from that, the CongoCC grammar files are just syntactically richer, what with lookahead/up-to-here and assertions and so on.

I wrote the CongoCC grammar for ANTLR using this one as a starting point. Well, again, the problem of parsing the .g4 files is solved. Soon, I'll get into the problem of outputting the resulting tree in a .ccc format. I don't know how well I can get it to work, due to the different sematics. Also, unlike ANTLR, CongoCC does not support left recursion. Though, that said, it is a well-known problem. I may figure out how to transform the productions that use left recursion and output them in CongoCC format without the left recursion. (No promises though!)

All that said, even a crude converter tool could probably save a lot of time for people in the position of the aforementioned ngx. And, anyway, I thought to announce this here because maybe somebody is interested. After all, there are a lot of existing ANTLR grammars, so the ability to parse them and build and manipulate the AST using a CongoCC tree traversal sort of API could already be useful.

Of course, generally speaking, nobody will know about something like this unless I tell them about it, so...

adMartem

A horse of a similar color...

Well, I accidentally found myself in ANTLR-world myself! You might have seem my sandbox commit adding a "peg" folder. It started out with my looking for a language that had a simple syntax, but was meaningful enough to allow me to experiment with fault-tolerant behavior and be able to try it with a variety of test cases that weren't all done by me. I decided that Ford's PEG grammar would be a good candidate since it was extremely small, yet provably powerful, and had a syntax that intuitively seemed like fault tolerance could be leveraged.
And so I wrote a PEG parser in CongoCC. I quickly realized that the PEG AST could be easily visited to produce a CongoCC equivalent parser by using up-to-here on every choice (not great performance potential, but so what). Of course when I did this, my test case was to generate a PEG parser in CongoCC from Ford's original PEG grammar, so the booting had begun. Upon achieving that goal, I had learned enough about PEG to realize I needed a more powerful version it to make life easier...
So, I added up-to-here (which I named entailment, since I was in math-world by that time) to allow limiting lookahead where desirable and more normal notation, since I was making mistakes remembering to type the now-archaic notation that Ford used. Then I went looking for test cases. Guess what I found. Essentially everyone that worked from the original Ford paper invented modifications and extensions to his grammar (including Ford), so there didn't seem to be any that were really compatible with the original. But I did find that ANTLR4 was darn close.

Now, as you know all-to-well, I'm sure, ANTLR might look like PEG, but it is not at all PEG semantically. In particular the predictive(?) parser it generates is more like a Packrat parser driven by an NFA. I remembered all the problems I had with ANTLR years ago, and finally understood why. I had "assumed" the grammar would work like a PEG grammar (although I had no knowledge of PEG at that time) and process the rule's choices left to right, rather than as a real CFG would. Boy was I misguided. But, for my current purpose as a test case source, ANTLR was a cornucopia of possibilities. Hence I chose CPP14.g4 as an experiment.

As it turned out, I didn't have to do much to it, and it was suddenly a grammatically correct (extended) PEG grammar of 2K+ lines. Unfortunately, the CongoCC it generates doesn't parse even the smallest C++ source correctly (I assume because the order of one or more choices in PEG does not result in the longest match, or because I messed up one of the 20 or so left recursions I had to refactor). I added the fragment keyword to what I now call PEG-alt, along with some minor scan-and-pitch things like ; so my next choice (the 200 line calculator grammar) was very quick to modify (remove the -> skips and comment the initial grammar declaration).

Which is where I am now with that little detour.

revusky

adMartem

I had "assumed" the grammar would work like a PEG grammar (although I had no knowledge of PEG at that time) and process the rule's choices left to right, rather than as a real CFG would. Boy was I misguided.

Yeah, well, the thing is that surely this is just how anybody would intuit the question. You have a series of options, possible branches, and you go through them in order and the first one that matches, you go with that one...

But, you know, generally speaking, all of this parsing theory pretty much has me perplexed, drives me nuts. I suppose you know the old quip, that modern abstract art is some massive conspiracy to convince ordinary people that they're stupid. So, I was wondering the same thing about some of this parsing theory. The context-free grammar (CFG) concept, that all of the branches in a choice have equal priority... As an implementer, how does that even work? It just seems pretty obvious to me that the earlier choices would have priority, no? I mean, the procedural programming that we all cut our teeth on tends to be like:

              if (conditionA) {do A}
              else if (conditionB) {do B}
              ....
              else {We have a problem, Houston...}

So you go through and you try the choices in order. And once you match a given condition and go into the corresponding block, you don't try to match any of the later ones! If conditionA matches, then you do A, and you don't even check conditionB or any of the later ones. To be saying that you check all the conditions and if more than one is true, you have a... drumroll... ambiguity... The original JavaCC had this really horrendous code in there to check for these "ambiguities", except to my mind, they are not ambiguities. They're just dead code. So if you write:

          Expression ("," Expression)* [","]

then using 1-token lookahead logic (which is what you have if you don't specify anything else) then the optional comma at the end will never be hit. You probably noticed that I put back the warnings for these things half a year ago or so, but the warning message does not say that there is any "ambiguity". It just says that the second comma is unreachable, which means you should have look at what you're doing there! But all this talk of "ambiguity"...

So, kudos to Bryan Ford for deciding that the earlier specified branch has priority. I was looking at this again just today and I see (I saw it before, I think) that Ford was experimenting with "memoization" to speed up the parsing. So he'd cache results of certain paths through the input, I guess. Actually, if you go through a CongoCC parser step by a step in a debugger, you do see that it's frequently scanning over the same sequence of tokens again and again. So one could wonder whether memoization could speed things up significantly. I thought about it somewhat, but I have my doubts by how much. Unless you're switching between lexical states during the lookahead (which is not the most typical case) all the tokens are cached, so you're just scanning a number of them and checking the tokenType and I think it's really pretty fast. After all, you type ant test and it builds the various parsers -- Java, Python, C# -- and in each case, parses through some thousands of files. I honestly doubt that many people have a practical need to parse code any faster than that! I dunno for sure. It's not like people show up and say they want to use this but it's too slow. But not many people show up and say anything at all, so...

But I just don't relate to all that blather about "ambiguity". Of course, natural language is full of ambiguity highly dependent on context. But you'd think that a parsing system like CongoCC just wouldn't have any ambiguity really. If you can build the parser code, it compiles and runs, then its behavior is pretty determinate. Even left recursion. There was a check for that in legacy JavaCC, but I just pulled it out at some point, because, like with the other "ambiguity" checks, the code was just horrible and I figured it would be easier to just write it from scratch at a later point. So I did it redo it and now it warns you about left recursion and exits, but before that, it would run, and you'd get a stack overflow... but that does not strike me as an ambiguous result. It looks pretty determinate to me!

Well, I could rant more about all this. Oh, here is one thing that I resolved (to my satisfaction anyway) some time ago, which is, you know, when you have two regexp specs that will match the input. Well, you know, the longest match wins, of course. But if they match the same length of input, then it is the earlier specified regexp that wins out. (Strangely, nobody says that is an ambiguity!) So, you can have a problem if you specify a literal word after the definition of an identifier so you have:

                <ID : (["a"-"z", "A"-"Z"])+>
                |
                <FOO : "foo" >

Of course, FOO will never be matched because if there is a foo, it will always be matched by the ID regexp, and since that is the earlier specified one... So, actually, the original JavaCC had some (again, totally obfuscated) code to check for these things and warn you. So, I removed that and, as in the other cases, left the user on his own. But then (as in the other cases) I just finally decided that if somebody specified a literal string and also a regexp that matches that literal string, they always will want the literal string to take priority. So I quietly changed things so it works that way. Basically, I just sort them so that the literal strings always occur earlier, or it's as if they did -- even if, in the file, they don't. I just couldn't conceive of when one would specify a literal string as a pattern and not want that to have priority over something more general.

But regardless of that, in a well specified system, how can you really have ambiguity? It surely just means that you haven't specified things precisely enough, no? Oh, and by the way, I just reread this somewhat sarcastic blog article that I wrote nearly 5 years ago, and really, I don't think my opinion has changed particularly. Well, I did put back in the warnings about the dead code, -- well, really just the ones that it is easy to warn about, where we're operating with single-token lookahed and no other conditions (like lookahead predicates expressed in Java code) that could complicate matters.

EOR (End of Rant)

adMartem

revusky
My sense is that "ambiguity" regarding parsers is pretty widely misused. As I recall the use of the term by the linguists that invented all this formalization referred to its generative meaning. I.e., a grammar is ambiguous iff it can generate the same resulting sentence via two or more distinct sequences of rules (more or less, by memory is fuzzy and I'm too lazy to look it up). Now, in computer-land, the word is used in the context of accepting the language and, to me, it seems like the term "redundant" is more meaningful in most cases when dealing with parsing that is not context-free and is top-down with ordered choices (i.e., a PEG-class parser). That is, if there is a way to recognize a distinct element of the language that is not reachable, it is redundant and possibly an error if there is a dependency outside of the grammar on its actions or position in the AST. So I think a warning whenever this case can be determined is a righteous choice.

adMartem

I remember well when you added the warnings. It uncovered about a dozen cases where my grammar needed to be cleaned up a bit, and revealed one case that, to this day, I can't figure out. I have a sticky note to suss it out someday. So your inclination to add that was worthwhile, at least for my peace of mind.

Yes, Ford also published a famous paper on memoized parsing, I think even before the seminal PEG paper. I think he should get kudos for bothering to formalize the obvious, to paraphrase a bit. I just dug it up because it was one of the few parser-related papers that I was able to fathom with only a few bathroom visits. I don't know if you've had a glance at the peg example in the sandbox, but if so, or you later do, you might notice that, after implementing the "pure" PEG" translator to CongoCC, I spliced a few selected CongoCC capabilities into the PEG-alt language that I thought added enough value to make it a usable, grammar language. The first was the up-to-here annotation. Since for PEG it was difficult (and probably misguided) to describe it in CongoCC terms, I was forced to think about it in the context of the PEG expression of the grammar. That was when I had the revelation that it is actually an indication of where in a sequence the choice of that sequence may be considered made. In other words, at what point does the choice entail the preceding matched elements. So I named it entailment and made the annotation >>>. I then went looking for a cool Unicode character that would be good as an alternative (since every academic language needs difficult-to-type symbols) and found the nested greater than ⫸. It actually adds considerable practical power to PEG, in that it allows the grammar writer to replace, for any choice sequence, the PEG requirement to accept the entire sequence with only acceptance up-to the annotation, at which point the choice is assured. Of course the implementation is trivial, only requiring that the generated CongoCC up-to-here on every sequence be suppressed and only the one at the point of the entailment retained.

The other significant improvement was to allow a < token-name > to be allowed as an identifier referring to a lexer token. During the booting process of getting a CongoCC grammar that would parse a true PEG grammar that could itself parse and generate a PEG grammar in CongoCC, the biggest problem was the so-called "feature" of PEG in that it required no Lexer. That was unbelievably difficult for me to conquer, and resulted in the contrivance of using only a single fixed parser token in the generated grammar (<ANY>). The others needed are all generated as single-state token classes that implement the "regular expression" syntax of the PEG grammar.

But I ramble, so I'll stop now.

revusky

adMartem

Well, I think maybe this stems from some kind of confusion as to what a grammar file is. Is it something like a legal document (or specification) or is it source code that can be compiled to something executable and run? I guess with CongoCC, any grammar is is primarily the latter, but it's also true that something like Java.ccc is sufficiently human-readable that you could think of it as the former. I can honestly say that I rarely consult the JLS nowadays. If I want to know if something is legal in Java, I consult the Java grammar that I myself worked up! (Admittedly, I did that by painstakingly reading and re-reading the JLS...) But, of course, the difference is that the Java.ccc actually builds a runnable parser. You can't build a runnable parser from the JLS. Maybe there is some AI out there that can do that, but I don't know about it. As for the CSharp language, that is a total nightmare because whatever spec one finds does not seem to be very rigorous or complete. So the main way I gained confidence that CSharp.ccc is correct was by just converging iteratively on being able to parse a suite of source files in the wild that would presumably have about any valid construct in them. Some tens of thousands of files. The Roslyn source code from the dotnet people themselves is, I think, about the best single codebase to run it over. That team seems pretty determined to actually use the most recent language features as soon as they are out. Of course, that is also how I became confident that the Java grammar was correct, that it can parse the code in src.zip, but also I consulted the actual specification, which I don't even bother to do for the Microsoft stuff, for the most part. On any niggling corner case, any docs they've written just seem to be useless, and as a practical matter, the language really is defined by its implementation!

But anyway, if you think of the grammar as a legal document, then something like:

         "foo" "bar"
         |
         "foo" "baz"

is saying, at face value, that either of the two choices is permissible. Letter of the law, you could think. But the fact remains that only the first of the two will ever be matched unless you give it some sort of instruction that amounts to scanning ahead two tokens to disambiguate. But this sort of thing does cause me to wonder sometimes. After all, the above case is so closed-ended, so cut-and-dry that it is feasible to just quietly put in the 2-token lookahead and have it work with no fuss. But, at least so far, I've declined to do that. I've just continued with the hard-nosed stance that the user must understand that we only scan ahead a single token (unless specified otherwise) and, despite whatever misconceptions, the above will only match the first choice. The second choice is unreachable, but since it is such a clear, closed-ended case, okay, we do warn about it.

But, you see, the way I end up looking at it is that, while it is true that there are very closed-ended cases like the above one where we could just quietly put in the obvious solution (scan 2 tokens ahead), in the general case, we can't really do some analysis that reliably identifies when a later choice is reachable or unreachable. At the very least, it is a very hard problem in general, one that we're not really attempting to solve.

I was thinking today that, in some ways, this matches the question of driving a manual vs automatic transmission car. If you are driving a stick shift, obviously it is not going to shift gears for you. If you are going 60 mph and are still in first gear, well, you probably should have shifted up at some point, but you have to do it yourself, it's not going to do it for you. And, granted, most of the market has moved to automatic (though I looked it up and noticed that 60-65% of the cars on the road in Spain are still stick shift, that's courtesy of DeepSeek chatbot.) But, regardless, where I think the analogy also remains valid is that, really, once you have a reasonable conceptual model of what is going on, it's not really all that onerous. I honestly don't think so. Like, once you get used to driving a stick shift, for the most part, the shifting becomes reflexive. You barely think about it because it just becomes part of your muscle memory when driving. And, I think similarly, sticking in an up-to-here or SCAN 2 in a case like the above becomes pretty reflexive. It's not really a big problem.

But again, as for this "context-free grammar" concept and the notion that it is some big hairy problem if more than one choice matches the input... well, I guess I'm repeating myself, but it really looks like something of an Ersatz problem. I think that any sane implementation would go through the choices one by and one and just opt for the first one that matches and disregard the later ones. Bryan Ford pointing this out as some great theoretical breakthrough. It really is like the character in the Proust novel who "discovers" that he has been speaking in prose all his life!

adMartem

Oh my god, Proust enters the stage! Yes, I think part of the "problem" (or the eponymous, "ambiguity"?) is that all this sprang out of, and is still used by, people asking, "what words can I string together to express a thought?", or action, or description, etc. and also, "what is the meaning of this string of words?" In our (computer-language-land) world, it is mostly the latter. BTW, I've noticed how many stick-shift cars there are in Spain. I hadn't thought about it before, but it was certainly more than here (where I had to special-order a car from the factory I wanted with stick in 2006) and I feel like it is more than England or Germany.
I believe denoting something in a grammar as ambiguous really refers to the notion that by looking at the sentence (input for us, output for linguists) you can't determine the rules that were applied to parse (or create) it. For CongoCC and PEG that can't happen, but for something like Ss : "s" Ss Ss | {} we just can't parse it (not that anyone using CongoCC would care). Right now I've been playing around with actually detecting things like that, indirect left recursion, and, in general, null path recursion. It is surprisingly brain-twisting. The good news is that, so far, I haven't found any real issues with any of the CongoCC test grammars in those areas. But I also haven't been able to successfully detect all the statically detectable ones I know of at the same time. But I am working on it.

revusky

adMartem
I got kind of interested in the whole question of stick shift vs. automatic a few days ago. Apparently -- at least according to the Deepseek AI chatbot -- 2010 is kind of the year of sobrepaso. After about that point in time, automatic transmission cars became more fuel-efficient than manual transmission. I infer that the algorithm for when to shift got sufficiently smart that it does it better than most human drivers. Of course, it might also be true that the stick shift is cheaper to maintain/repair, so there is that. (At least as long as there are mechanics who know how to work on them!)

Up until about 2010, the stick shift was more fuel efficient and that was always what I had believed, but then in the last few days, I started wondering and though to ask an AI chatbot or two, and that is the answer I get. I also asked DeepSeek how many U.S. drivers even know how to drive a stick shift. It told me: "As of recent estimates, about 80-90% of drivers in the USA do not know how to drive a manual transmission car." ChatGPT said the same, but also said there were other estimates that were lower, like only about half don't know how. I would guess that currently it is way more than half of drivers in North America that have never driven a stick.

Yesterday I was walking down a street lined with parked cars near where I live and peeping into the windows and it wasn't until the 10th car that one was automatic. So I thought, so much for DeepSeek telling me that 40% of the cars in Spain were automatic. But then I looked at some more and I looked at 37 finally, and 9 of them were automatic. So maybe the first 9 in a row being stick shift was not representative.

But, you know, the thing about all of that is that the difference in fuel efficiency between one and the ohter is not so great. It used to be maybe 3-5% in favor of the stick shift but has shifted the other way. I think it's safe to say that if one was 30 times as fuel-efficient as the other... I mean, imagine if, back in the day, a manual got 30 mpg and an automatic got 1 mpg. That would just be... The automatic transmission cars became dominant because most people found it more comfortable to drive them and the fuel efficiency was only a little worse.

I never really understood anything about ANTLR until recently. The parsing is based on a completely different paradigm, something more like the way the lexical side works. It generates a state machine. That's what the term ATN that you see all over the place there means. ATN stands for "Augmented Transitional Network". It's a state machine that actually does effectively scan the choices in parallel.

What is curious is that (a) the state machine approach (DFA or NFA or whatever) really does work very well for tokenization. And (b) it is not what any typical human being would come up with if asked to write a program to break input into tokens! But the state machine approach is apparently more efficient than what a human would write. And it also is kind of beautiful once you understand it... Regular expressions are nifty.

But the state machine approach is also not at all how most humans think intuitively about the parsing problem. IMHO, up-to-here corresponds much more to how most people actually think about the problem. You scan forward to some point and there is a clear demarcation and after that point we can commit to whatever branch. So you have something like:

   ClassDeclaration:
          Modifiers "class" =>|| 
          etc... ;

You scan past all the modifiers and when the next token is "class" then you have a ``ClassDeclaration`. And so on. But that is precisely how a human being would recognize that this is a class declaration.

Anyway, the ANTLR community in their grammars-v4 repository has 4 different Java grammars. Only 1 of them seems usable. The other ones are 50 to 100 times slower than our CongoCC Java grammar. The one that is apparently "optimized" here has some rather disturbing aspects. For one thing, it makes no attempt to enforce the rules about modifiers. For example, look here. They just say it's classOrInterfaceModifier* so presumably you could write public private or protected protected and the grammar writer has no means of precluding that. I guess if you asked them how to deal with this, they would tell you to walk the tree afterwards and check for these things. I guess... And it's true you can do that, but I did set myself the task of writing a completely correct Java grammar.

Oh, here is another thing that I just find kind of gobsmacking. Here is their definition of an identifier:

       IDENTIFIER: Letter LetterOrDigit*;

and here, a few lines below is how Letter and LetterOrDigit are defined.

fragment LetterOrDigit: Letter | [0-9];

fragment Letter:
    [a-zA-Z$_]                        // these are the "java letters" below 0x7F
    | ~[\u0000-\u007F\uD800-\uDBFF]   // covers all characters above 0x7F which are not a surrogate
    | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to 
                                                             //U+10FFFF
;

Isn't that preposterous? They're saying that all the 7-bit ascii letters, a-z and A-Z are letters and then after 0x7F, EVERYTHING is a letter! Except for high-low surrogate pairs, which they do apparently know about!

Well, one could understand that, at an early stage of development, one just ignores the problem, figuring you'll get back to it later and get it right. After all, this is something quite easy to get right. I never spent that much time on it. I wrote a little program that generates the JavaCC/CongoCC definition of what is a letter in an identifier. That is here. So, meanwhile... the long dash character – can be in an identifier. The upside down question mark used in Spanish, '¿', that's also part of a Java identifier?

I dunno. I just look at this and I find it infuriating. Really, let's face it. These people are clowns. How does one take these people seriously?

adMartem

I'm down to only one known recursion hazard undetected. I think I'll add breadcrumbs since it can be hard to find the actual path.

adMartem

revusky
In my PEG+ mucking I attempted to get the CPP14.g4 grammar example to actually parse C++ (after generating what I thought was an equivalent CongoCC grammar using by PEG+ Pongo conversion). Before I could try that, I of course had to convert the left recursion to iteration in about 20 places and then, because I found rules like this:

enumkey
:
    Enum
    | Enum Class
    | Enum Struct
;

I thought that ANTLR must evaluate all the alternatives and pick the longest match, so I rewrote ones like that to put the choices in PEG-like deterministic order. I could never get the grammar to match a real C++ program, which I attributed to some mistake I must have made in the rewriting. Now that I look at the ANTLR docs, I see that it says that the first matching prefix-viable alternative will always be parsed! So by that, I would say if the CPP14.g4 grammar actually parses C++, it is by accident.

I was misled, I think, by the academic papers describing the behavior of the ALL(*) prediction which, I recall, led me to believe the parser used a longest match disambiguation behavior. My confusion was probably because the papers were describing how the predicted match was being determined and not how the parser would actually choose to parse the input as it encountered it (maybe???). Anyway, I'll just color myself confused and press on regardless.

adMartem

Why would you do that! This is a problem with a lot of (maybe most) "grammars" I've run into, in that they parse a language that "resembles" the real language, but would never be usable to compile the language. CICS, for example, has lots of parsers around on github, including ANTLR ones, but they don't actually implement the language IBM supports. This was why I ended up making the cardinality support for CongoCC initially, but now that I have it I've found lots of places in COBOL where I had either done half-assed semantic rules to accomplish it or missed cases entirely that can be easily accommodated with cardinality. Now that I'm looking at left-recursion in detail, I realize that cardinality potentially plays a part in that, or at least functionally intersects, given the duality of iteration and recursion. I had thought when I did the current cardinality assertions that a possibly desirable future extension was allowing the controlling iteration to extend beyond the current production to the nearest enclosing iteration in the parsing stack. Now I realize it is "congruent" to limiting recursion in an iteration-free equivalent of the grammar. Not sure what that is worth, but I found it interesting.

BTW, have you noticed this?

adMartem

adMartem CongoCC is so much more transparent than ANTLR (is my conclusion).

revusky

adMartem CongoCC is so much more transparent than ANTLR (is my conclusion).

Well, actually, I think the above very much understates the problem.

Now, admittedly, a lot of my impression is based on studying the Java grammar in what is basically their "contrib" repository. Actually, there are 4 different Java grammars -- Java8, Java9, Java20, and just plain Java. I was playing around with them and my first conclusion is that there is only one of the 4 that is useable, which is the one with no number on it. This is the one that is supposedly "optimized".

And it is quite true that the other three are just horrifically slow. A basic benchmark I use typically is to run it over the JDK sources, src.zip. And it takes 30 or 40 minutes or something, depending which grammar you try it with. The "optimized" grammar does it in a bit over a minute, which is still nearly twice as slow as the CongoCC Java grammar.

I mentioned earlier that that grammar did not incorporate ANY knowledge of what a permissible Java identifier is. I though that was kind of shocking, but I later looked at the thing more closely and realized that it also does not incorporate other very fundamental knowledge about the rules the language. For example, that grammar seems to believe that:

x;

is a valid statement in Java. Well, of course, it's not. x(); is a valid statement. x=7; is a valid statement. new X() is a valid statement. Now, to me, that kind of thing is not so interesting. The problem that I am grappling with is generating human-readable messages. So, if you give the Congo Java grammar the statement x; as of now it generates:

Assertion at: Java.ccc:1119:7 failed. Expression at Foobar.java:11:10 is not a valid statement.
Expecting a method call, an assignment or an allocation expression (new...)
    at Foobar.java:11:11 in StatementExpression(Java.ccc:1119:7,JavaParser.java:8155)
    at Foobar.java:11:10 in ExpressionStatement(Java.ccc:1131:23,JavaParser.java:8193)
    at Foobar.java:11:10 in Statement(Java.ccc:1000:3,JavaParser.java:7610)
    at Foobar.java:11:10 in BlockStatement(Java.ccc:1045:3,JavaParser.java:7804)
    at Foobar.java:11:10 in Block(Java.ccc:1028:50,JavaParser.java:7712)
    at Foobar.java:2:18 in MethodDeclaration(Java.ccc:511:5,JavaParser.java:3354)
    at Foobar.java:2:7 in ClassOrInterfaceBodyDeclaration(Java.ccc:438:3,JavaParser.java:2905)
    at Foobar.java:2:7 in ClassOrInterfaceBody(Java.ccc:429:54,JavaParser.java:2844)
    at Foobar.java:1:21 in ClassDeclaration(Java.ccc:277:3,JavaParser.java:1693)
    at Foobar.java:1:1 in TypeDeclaration(Java.ccc:217:5,JavaParser.java:1524)
    at Foobar.java:1:1 in CompilationUnit(Java.ccc:110:5,JavaParser.java:1116)
    at Foobar.java:1:1 in Root(Java.ccc:34:4,JavaParser.java:420)

This is a result of my ongoing work to generate better stack traces. Actually, I put together a sample code snippet with a bunch of invalid statements here which is:

public class Foobar {
     void foo() {
         foobar()++;
         ++this;
         x = 7++;
         this = 7;
         7++;
         -8;
         x + 3;
        (x()++);
        x?y:z = t;
  }

}

Every single line in the foo method above is invalid, of course. And the Java grammar in CongoCC rejects them as invalid. Well, this is like "dog bites man", not very surprising.

Would you believe that the aforementioned "optimized" Java grammar "parses" the above code with absolutely no complaint?

Apparently, the people behind that, and that includes the famous Terence Parr, decided to take the shortcut that expression ';' is a valid statement in Java. With any expression.

What I realized a bit after that was that this, in combination with the fact that any character above 0x7F is considered to be a valid "letter" in an identifier. So... the aforementioned parser accepts the following as a valid statement:

    ¿?¡: 🚀;

(Really, it does! Try it!)

The first symbol, the upside down question mark, is of course a character common in the Spanish speaking world, and it is a valid identifier (according to this grammar). So is the upside down exclam. And, of course, the regular '?' and ':' retain their meaning. So the above is a ternary expression. If upside-down question-mark (presumably a boolean variable) evaluates to true, then the expression evaluates to ¡ and if not, then it is the character 0x1F680, which is the pictograph for a rocket or spaceship, I guess. Also recognized as a valid Java identifier. Maybe that is the spaceship on which these people arrived on from planet Krypton.

Or you could have:

       !⚣ ? 💒 : ⚯;

The double-male symbol ⚣ apparently represents male homosexuality. The church pictograph with the hearts represents marriage, and the final symbol supposedliy represents unmarried partnership.

But, you know, the damnedest thing about the above examples is that even if you replace the above symbols with legit identifiers, so you have something like:

     x?y:z;

it's still not a valid statement in Java!

A couple of weeks ago, I opened up a channel of communication with these people. You can see that here. I didn't ask him them why they have such a funky definition what valid Java is. Though, actually, I started the conversation 2 weeks ago and did not realize the full extent of this at that time. I just wanted to know how one of the afore-mentioned 4 grammars can be at least 30 or 40 times faster than the other ones. At first, when I saw just how permissive the "optimized" grammar was, I thought that was why it was so much faster, but then I tried making the other grammars equivalently loose -- at least on that key issue -- and after rebuilding they remained just as slow as before. It's the most befuddling thing.

Do look at that conversation there. I find it quite extraordinary.

To be maximally fair, it is true that the Java grammar in question (or actually all 4 of them) are not core parts of the ANTLR tool. Our Java grammar really is a core part of the tool. We need the Java parsing capability (and Python and C# for polyglot) and this is a component that is part of our core functionality -- even though it is structured so that somebody could use it on its own. So the comparison could be considered unfair. But still, really...

The issue with the thing not knowing what a valid Java identifier is, that'ss one thing and quite easy to fix. However, it came to my attention that the "optimized" grammar generates a parser with massive memory requirements. To run over the JDK source code, it needs about 400 megs of heap. The other grammars are even worse. The Java9 one runs out of memory. It needs about 6 gigabytes of heap to parse the JDK source code. (Our parser does fine with 20 megs.) I could not get the Java20 grammar to parse the JDK source code. Even you feed it one very big file, like any of our generated grammars, and it just fails. It runs out of memory, even if you launch it with -Xmx10G or something like that.

The other interesting aspect of these things is that they do not generate human readable error messages. Typically, the error it reports is invariably the same: "No valid alternative", though it does give the location fairly accurately. And that, of course, is when it does report an error. It is very spotty on that, as I described. Apparently there is a way to get an ANTLR generated parser to emit decent (so they say) error message, and it involves registering some sort of listener object and...

But anyway, do have a look at my conversation with those people. I still find these sorts of things extraordinary -- though I suppose I shouldn't. Basically, it boils down to the fact that they hate being told anything. They really hate it! Of course, that's normal, I guess, but in my view, it's one thing to hate being told anything. But one should learn to dissimulate a bit better, no?

revusky

revusky
Actually, the pictographs did not display even though they did when editing. Maybe because that was a monospaced font used for code and didn't have the glyphs. Let me try with regular proportional font:
!⚣ ? 💒 : ⚯; and ¿?¡: 🚀;

adMartem

Kind of touchy aren't they. For a parser generator that's not suitable for production use (to paraphrase), it certainly has a lot of users expecting it to work. At least that's my impression.

Incidentally, the token definitions for COBOL "letters" looks like this:


/*
 *  The following comprises the set of characters allowed in user-defined words.
 *  The characters include the letters, ideographic and syllabic characters, digits,
 *  modifiers, and combining marks recommended for programming language identifiers
 *  in Annex A of ISO/IEC TR 10176:2003. These characters can be used to write
 *  many natural languages of the world.
 *  It also corresponds to the set allowed by the committee draft COBOL standard
 *  ISO/IEC 1989:20xx.  If/when this becomes a standard, this set will be aligned if it
 *  differs from this draft.
 */
< COBOL_STATE, FUNCTION_STATE, SQL_STATE, JAVA_STATE > TOKEN :
  < #LATIN:
      [
        "\u0041"-"\u005A",
        "\u0061"-"\u007A",
        "\u00AA", "\u00BA", "\u00C0"-"\u00D6", "\u00D8"-"\u00F6", "\u00F8"-"\u01BA", "\u01BB", "\u01BC"-"\u01BF",
        "\u01C0"-"\u01C3", "\u01C4"-"\u021F", "\u0222"-"\u0233", "\u0250"-"\u02AD", "\u1E00"-"\u1E9B", "\u1EA0"-"\u1EF9", "\u207F"
      ]
  >
|
  < #GREEK:
      [
        "\u0386", "\u0388"-"\u038A", "\u038C", "\u038E"-"\u03A1", "\u03A3"-"\u03CE", "\u03D0"-"\u03D7", "\u03DA"-"\u03F3", "\u1F00"-"\u1F15", "\u1F18"-
        "\u1F1D", "\u1F20"-"\u1F45", "\u1F48"-"\u1F4D", "\u1F50"-"\u1F57", "\u1F59", "\u1F5B", "\u1F5D", "\u1F5F"-"\u1F7D", "\u1F80"-"\u1FB4", "\u1FB6"-"\u1FBC",
        "\u1FC2"-"\u1FC4", "\u1FC6"-"\u1FCC", "\u1FD0"-"\u1FD3", "\u1FD6"-"\u1FDB", "\u1FE0"-"\u1FEC", "\u1FF2"-"\u1FF4", "\u1FF6"-"\u1FFC"
      ]
  >
|
  < #CYRILLIC:
      [
        "\u0400"-"\u0481", "\u048C"-"\u04C4", "\u04C7"-"\u04C8", "\u04CB"-"\u04CC", "\u04D0"-"\u04F5", "\u04F8"-"\u04F9"
      ]
  >
|
  < #ARMENIAN:
      [
        "\u0531"-"\u0556", "\u0561"-"\u0587"
      ]
  >
|
  < #HEBREW:
      [
        "\u05B0"-"\u05B9", "\u05BB"-"\u05BD", "\u05BF", "\u05C1"-"\u05C2", "\u05D0"-"\u05EA", "\u05F0"-"\u05F2"
      ]
  >
|
  < #ARABIC:
      [
        "\u0621"-"\u063A", "\u0640", "\u0641"-"\u064A", "\u064B"-"\u0652", "\u0670", "\u0671"-"\u06D3", "\u06D5", "\u06D6"-"\u06DC", "\u06E5"-"\u06E6",
        "\u06E7"-"\u06E8", "\u06EA"-"\u06ED", "\u06FA"-"\u06FC"
      ]
  >
|
  < #SYRIAC:
      [
        "\u0710", "\u0711", "\u0712"-"\u072C"
      ]
  >
|
  < #THAANA:
      [
        "\u0780"-"\u07A5", "\u07A6"-"\u07B0"
      ]
  >
|
  < #DEVANAGARI:
      [
        "\u0901"-"\u0902", "\u0903", "\u0905"-"\u0939", "\u093D", "\u093E"-"\u0940", "\u0941"-"\u0948", "\u0949"-"\u094C", "\u094D", "\u0950", "\u0951"-"\u0952",
        "\u0958"-"\u0961", "\u0962"-"\u0963"
      ]
  >
|
  < #BENGALI:
      [
        "\u0981", "\u0982"-"\u0983", "\u0985"-"\u098C", "\u098F"-"\u0990", "\u0993"-"\u09A8", "\u09AA"-"\u09B0", "\u09B2", "\u09B6"-"\u09B9", "\u09BE"-"\u09C0",
        "\u09C1"-"\u09C4", "\u09C7"-"\u09C8", "\u09CB"-"\u09CC", "\u09CD", "\u09DC"-"\u09DD", "\u09DF"-"\u09E1", "\u09E2"-"\u09E3", "\u09F0"-"\u09F1"
      ]
  >
|
  < #GURMUKHI:
      [
        "\u0A02", "\u0A05"-"\u0A0A", "\u0A0F"-"\u0A10", "\u0A13"-"\u0A28", "\u0A2A"-"\u0A30", "\u0A32"-"\u0A33", "\u0A35"-"\u0A36", "\u0A38"-"\u0A39",
        "\u0A3E"-"\u0A40", "\u0A41"-"\u0A42", "\u0A47"-"\u0A48", "\u0A4B"-"\u0A4D", "\u0A59"-"\u0A5C", "\u0A5E", "\u0A72"-"\u0A74"
      ]
  >
|
  < #GUJARATI:
      [
        "\u0A81"-"\u0A82", "\u0A83", "\u0A85"-"\u0A8B", "\u0A8D", "\u0A8F"-"\u0A91", "\u0A93"-"\u0AA8", "\u0AAA"-"\u0AB0", "\u0AB2"-"\u0AB3", "\u0AB5"-
        "\u0AB9", "\u0ABD", "\u0ABE"-"\u0AC0", "\u0AC1"-"\u0AC5", "\u0AC7"-"\u0AC8", "\u0AC9", "\u0ACB"-"\u0ACC", "\u0ACD", "\u0AD0", "\u0AE0"
      ]
  >
|
  < #ORIYA:
      [
        "\u0B01", "\u0B02"-"\u0B03", "\u0B05"-"\u0B0C", "\u0B0F"-"\u0B10", "\u0B13"-"\u0B28", "\u0B2A"-"\u0B30", "\u0B32"-"\u0B33", "\u0B36"-"\u0B39",
        "\u0B3D", "\u0B3E", "\u0B3F", "\u0B40", "\u0B41"-"\u0B43", "\u0B47"-"\u0B48", "\u0B4B"-"\u0B4C", "\u0B4D", "\u0B5C"-"\u0B5D", "\u0B5F"-"\u0B61"
      ]
  >
|
  < #TAMIL:
      [
        "\u0B82", "\u0B83", "\u0B85"-"\u0B8A", "\u0B8E"-"\u0B90", "\u0B92"-"\u0B95", "\u0B99"-"\u0B9A", "\u0B9C", "\u0B9E"-"\u0B9F", "\u0BA3"-"\u0BA4",
        "\u0BA8"-"\u0BAA", "\u0BAE"-"\u0BB5", "\u0BB7"-"\u0BB9", "\u0BBE"-"\u0BBF", "\u0BC0", "\u0BC1"-"\u0BC2", "\u0BC6"-"\u0BC8", "\u0BCA"-"\u0BCC",
        "\u0BCD"
      ]
  >
|
  < #TELUGU:
      [
        "\u0C01"-"\u0C03", "\u0C05"-"\u0C0C", "\u0C0E"-"\u0C10", "\u0C12"-"\u0C28", "\u0C2A"-"\u0C33", "\u0C35"-"\u0C39", "\u0C3E"-"\u0C40", "\u0C41"-
        "\u0C44", "\u0C46"-"\u0C48", "\u0C4A"-"\u0C4D", "\u0C60"-"\u0C61"
      ]
  >
|
  < #KANNADA:
      [
        "\u0C82"-"\u0C83", "\u0C85"-"\u0C8C", "\u0C8E"-"\u0C90", "\u0C92"-"\u0CA8", "\u0CAA"-"\u0CB3", "\u0CB5"-"\u0CB9", "\u0CBE", "\u0CBF", "\u0CC0"-
        "\u0CC4", "\u0CC6", "\u0CC7"-"\u0CC8", "\u0CCA"-"\u0CCB", "\u0CCC"-"\u0CCD", "\u0CDE", "\u0CE0"-"\u0CE1"
      ]
  >
|
  < #MALAYALAM:
      [
        "\u0D02"-"\u0D03", "\u0D05"-"\u0D0C", "\u0D0E"-"\u0D10", "\u0D12"-"\u0D28", "\u0D2A"-"\u0D39", "\u0D3E"-"\u0D40", "\u0D41"-"\u0D43", "\u0D46"-
        "\u0D48", "\u0D4A"-"\u0D4C", "\u0D4D", "\u0D60"-"\u0D61"
      ]
  >
|
  < #SINHALA:
      [
        "\u0D82"-"\u0D83", "\u0D85"-"\u0D96", "\u0D9A"-"\u0DB1", "\u0DB3"-"\u0DBB", "\u0DBD", "\u0DC0"-"\u0DC6", "\u0DCA", "\u0DCF"-"\u0DD1",
        "\u0DD2"-"\u0DD4", "\u0DD6", "\u0DD8"-"\u0DDF", "\u0DF2"-"\u0DF3"
      ]
  >
|
  < #THAI:
      [
        "\u0E01"-"\u0E30", "\u0E31", "\u0E32"-"\u0E33", "\u0E34"-"\u0E3A", "\u0E40"-"\u0E45", "\u0E46", "\u0E47"-"\u0E4E"
      ]
  >
|
  < #LAO:
      [
        "\u0E81"-"\u0E82", "\u0E84", "\u0E87"-"\u0E88", "\u0E8A", "\u0E8D", "\u0E94"-"\u0E97", "\u0E99"-"\u0E9F", "\u0EA1"-"\u0EA3", "\u0EA5", "\u0EA7",
        "\u0EAA"-"\u0EAB", "\u0EAD"-"\u0EB0", "\u0EB1", "\u0EB2"-"\u0EB3", "\u0EB4"-"\u0EB9", "\u0EBB"-"\u0EBC", "\u0EBD", "\u0EC0"-"\u0EC4", "\u0EC6",
        "\u0EC8"-"\u0ECD", "\u0EDC"-"\u0EDD"
      ]
  >
|
  < #TIBETAN:
      [
        "\u0F00", "\u0F18"-"\u0F19", "\u0F35", "\u0F37", "\u0F39", "\u0F40"-"\u0F47", "\u0F49"-"\u0F6A", "\u0F71"-"\u0F7E", "\u0F7F", "\u0F80"-"\u0F84", "\u0F86"-
        "\u0F87", "\u0F88"-"\u0F8B", "\u0F90"-"\u0F97", "\u0F99"-"\u0FBC"
      ]
  >
|
  < #MYANMAR:
      [
        "\u1000"-"\u1021", "\u1023"-"\u1027", "\u1029"-"\u102A", "\u102C", "\u102D"-"\u1030", "\u1031", "\u1032", "\u1036"-"\u1037", "\u1038", "\u1039", "\u1050"-
        "\u1055", "\u1056"-"\u1057", "\u1058"-"\u1059"
      ]
  >
|
  < #GEORGIAN:
      [
        "\u10A0"-"\u10C5", "\u10D0"-"\u10F6"
      ]
  >
|
  < #ETHIOPIC:
      [
        "\u1200"-"\u1206", "\u1208"-"\u1246", "\u1248", "\u124A"-"\u124D", "\u1250"-"\u1256", "\u1258", "\u125A"-"\u125D", "\u1260"-"\u1286", "\u1288", "\u128A"-
        "\u128D", "\u1290"-"\u12AE", "\u12B0", "\u12B2"-"\u12B5", "\u12B8"-"\u12BE", "\u12C0", "\u12C2"-"\u12C5", "\u12C8"-"\u12CE", "\u12D0"-"\u12D6",
        "\u12D8"-"\u12EE", "\u12F0"-"\u130E", "\u1310", "\u1312"-"\u1315", "\u1318"-"\u131E", "\u1320"-"\u1346", "\u1348"-"\u135A"
      ]
  >
|
  < #CHEROKEE:
      [
        "\u13A0"-"\u13F4"
      ]
  >
|
  < #SYLLABICS:
      [
        "\u1401"-"\u166C", "\u166F"-"\u1676"
      ]
  >
|
  < #OGHAM:
      [
        "\u1681"-"\u169A"
      ]
  >
|
  < #RUNIC:
      [
        "\u16A0"-"\u16EA", "\u16EE"-"\u16F0"
      ]
  >
|
  < #KHMER:
      [
        "\u1780"-"\u17B3", "\u17B4"-"\u17B6", "\u17B7"-"\u17BD", "\u17BE"-"\u17C5", "\u17C6", "\u17C7"-"\u17C8", "\u17C9"-"\u17D3"
      ]
  >
|
  < #MONGOLIAN:
      [
        "\u1820"-"\u1842", "\u1843", "\u1844"-"\u1877", "\u1880"-"\u18A8", "\u18A9"
      ]
  >
|
  < #HIRAGANA:
      [
        "\u3041"-"\u3094"
      ]
  >
|
  < #KATAKANA:
      [
        "\u30A1"-"\u30FA", "\u30FB", "\u30FC"
      ]
  >
|
  < #BOPOMOFO:
      [
        "\u3105"-"\u312C", "\u31A0"-"\u31B7"
      ]
  >
|
  < #UNIFIED_IDEOGRAPHS:
      [
        "\u3400"-"\u4DB5", "\u4E00"-"\u9FA5", "\uFA0E"-"\uFA0F", "\uFA11", "\uFA13"-"\uFA14", "\uFA1F", "\uFA21", "\uFA23"-"\uFA24", "\uFA27"-"\uFA29"
      ]
  >
|
  < #YI:
      [
        "\uA000"-"\uA48C"
      ]
  >
|
  < #HANGUL:
      [
        "\uAC00"-"\uD7A3"
      ]
  >
|
  < #DIGIT:
      [
        "\u0030"-"\u0039", "\u0660"-"\u0669", "\u06F0"-"\u06F9", "\u0966"-"\u096F", "\u09E6"-"\u09EF", "\u0A66"-"\u0A6F", "\u0AE6"-"\u0AEF", "\u0B66"-"\u0B6F",
        "\u0BE7"-"\u0BEF", "\u0C66"-"\u0C6F", "\u0CE6"-"\u0CEF", "\u0D66"-"\u0D6F", "\u0E50"-"\u0E59", "\u0ED0"-"\u0ED9", "\u0F20"-"\u0F29", "\u1040"-
        "\u1049", "\u1369"-"\u1371", "\u17E0"-"\u17E9", "\u1810"-"\u1819"
      ]
  >
|
  < #SPECIAL_LETTERS:
      [
        "\u00B5", "\u02B0"-"\u02B8", "\u02BB"-"\u02C1", "\u02D0"-"\u02D1", "\u02E0"-"\u02E4", "\u02EE", "\u037A", "\u0559", "\u1FBE", "\u203F"-"\u2040",
        "\u2102", "\u2107", "\u210A"-"\u2113", "\u2115", "\u2119"-"\u211D", "\u2124", "\u2126", "\u2128", "\u212A"-"\u212D", "\u212F"-"\u2131", "\u2133"-"\u2134",
        "\u2135"-"\u2138", "\u2139", "\u2160"-"\u2183", "\u3005", "\u3006", "\u3007", "\u3021"-"\u3029", "\u3038"-"\u303A"
      ]
  >
|
  < #ADDITIONAL_CHARS:
      [
        "$",
        "\u005F", // (low line)
        "\u00B7"  // (middle dot)
      ]
  >
;

< COBOL_STATE, FUNCTION_STATE, SQL_STATE, JAVA_STATE > TOKEN :
  < #LETTER:
    <LATIN> | 
    <GREEK> |
    <CYRILLIC> |
    <ARMENIAN> |
    <HEBREW> |
    <ARABIC> |
    <SYRIAC> |
    <THAANA> |
    <DEVANAGARI> |
    <BENGALI> |
    <GURMUKHI> |
    <GUJARATI> |
    <ORIYA> |
    <TAMIL> |
    <TELUGU> |
    <KANNADA> |
    <MALAYALAM> |
    <SINHALA> |
    <THAI> |
    <LAO> |
    <TIBETAN> |
    <MYANMAR> |
    <GEORGIAN> |
    <ETHIOPIC> |
    <CHEROKEE> |
    <SYLLABICS> |
    <OGHAM> |
    <RUNIC> |
    <KHMER> |
    <MONGOLIAN> |
    <HIRAGANA> |
    <KATAKANA> |
    <BOPOMOFO> |
    <UNIFIED_IDEOGRAPHS > |
    <YI> |
    <HANGUL> |
    <SPECIAL_LETTERS> |
    <ADDITIONAL_CHARS>
  >
;

revusky

Kind of touchy aren't they.

Well, yeah. My own sense of this is that there are these people who really hate being told anything. Of course, the biggest A-1-A problem with that is that a consequence of this is that if you're like that, just about nobody will ever try to tell you anything -- after all if trying to tell somebody something tends to get that unpleasant, then...

By the way, there has been another iteration or so of my conversation with those people. See here

For a parser generator that's not suitable for production use (to paraphrase), it certainly has a lot of users expecting it to work. At least that's my impression.

Well, yeah, they're certainly pushing this ANTLR thing as a mature tool that somebody could really want to use... The whole situation is very strange to me, but it's also kind of déjà vu. Back when I got so involved with FreeMarker (over 20 years ago at this point) our main competitor was this thing from Apache Software Foundation called "Velocity". My God that thing was inferior! And, yeah, I found it utterly exasperating that we were ostensibly competing with that kind of dreck. But it's like there's this set of approved things that are credited with being of a certain quality and so on... And it's a crazy situation. The whole thing attracts its coterie of fanboys and...

Of course, here, we're mostly talking about the Java grammar(s) that they have in their repository, and the resulting Java parser. But when you just see how sloppy they are on something like that, it just doesn't inspire any confidence. I mean, for one thing, if they don't have a decent grammar out-of-the-box for Java, then what could you expect of the grammars for much more obscure languages? I mean, if they have grammars for Oberon or.... Ada... (which they do) and those are kinda lame (which I'm not sure of but I have my suspicions) that's one thing. But, you know, Java is such an important language that you'd think they'd want to offer something serious and professional as a testament to what ANTLR is capable of. (Shrug.)

adMartem Incidentally, the token definitions for COBOL "letters" looks like this:

Well, what is appaling about that is that it really is pretty easy to get stuff like that right. The level of sloppiness and unseriousness of just saying that any character beyond 0x7F can be a "letter" in an identifier...

I was playing around with this stuff recently and trying to get things more correct. Certainly, the Java grammar in CongoCC has not been perfect. Here is a funny one. Consider the expression 2.x(), let's say. Well, obviously, it is invalid, because you can't derereference a numerical literal. Our parser was rejecting that, but I could not see why! And I was puzzled. It got to the point where I fired up the debugger and was stepping through it and finally, I realized the issue!

You see, 2. is a floating point literal in Java. I would have thought that the zero after the . would be mandatory and so I was thinking 2.x was tokenizing as NUMERICAL_LITERAL DOT IDENTIFIER. But since 2. is a floating literal, it was scanning as NUMERICAL_LITERAL IDENTIFIER which obviously cannot be parsed. In fact, the Java parser from the ANTLR community refused to parse it. It gave the typical informative message of "no valid alternative". But once I realized that 2. was a numerical literal, I realized that the expression to check for was 2..x because the first dot just terminates the number literal and then the next dot would be for dereferencing the LHS. And then, I saw that both parsers, ours and the ANTLR people's parsed things like 2..x with no complaints. Except now ours doesn't. The current version (of OUR parser) now tells you:

Encountered an error at Foo.java:7:11
Assertion at: Java.ccc:891:4 failed. A numerical literal cannot be derereferenced.
    at Foo.java:7:11 in PrimaryExpression(Java.ccc:891:4,JavaParser.java:6364)
    at Foo.java:7:9 in StatementExpression(Java.ccc:1121:3,JavaParser.java:8020)
    at Foo.java:7:9 in ExpressionStatement(Java.ccc:1170:23,JavaParser.java:8156)
    at Foo.java:7:9 in Statement(Java.ccc:1046:3,JavaParser.java:7652)
    at Foo.java:7:9 in BlockStatement(Java.ccc:1089:3,JavaParser.java:7846)
    at Foo.java:7:9 in Block(Java.ccc:1072:50,JavaParser.java:7754)
    at Foo.java:6:17 in MethodDeclaration(Java.ccc:497:5,JavaParser.java:3229)
    at Foo.java:6:5 in ClassOrInterfaceBodyDeclaration(Java.ccc:424:3,JavaParser.java:2780)
    at Foo.java:6:5 in ClassOrInterfaceBody(Java.ccc:415:54,JavaParser.java:2719)
    at Foo.java:5:19 in ClassDeclaration(Java.ccc:279:3,JavaParser.java:1599)
    at Foo.java:5:1 in TypeDeclaration(Java.ccc:221:5,JavaParser.java:1430)
    at Foo.java:5:1 in CompilationUnit(Java.ccc:107:5,JavaParser.java:1006)
    at Foo.java:1:1 in Root(Java.ccc:34:4,JavaParser.java:410)
Parse failed on: Foo.java

So it gives you pretty exhaustive info about how you got there!

Improving the error messages like this has been fairly low-hanging fruit really, but it also it made me realize something, which I think I'll detail in a new thread.

adMartem

I just realized ANTLR might be described as an Attractive Nuisance To Lure Rookies.