A performance data point

adMartem

revusky I agree it seems pretty reasonable. I'm not inclined to worry about performance at this point anyway, since I think it is more important to get the entire process through tree building working correctly for all 12,000 or so regression tests, something that is a work-in-progress. I will certainly let you know of anything I might mind in this area, of course.

adMartem

revusky The approach you outline is essentially what we do for COBOL variable initialization in our generated code. Basically, java generated for COBOL has to avoid the static initializer limit, constant pool limit, and the normal method byte-code limit. I'll try some changes to the template to do as you suggest when I think I have the entire grammar working in non-fault-tolerant mode (not quite there yet). At that point can also send you the grammar if I still have problems. Thanks.

adMartem

Good news on the performance front! In the process of getting the grammar to work with more and more test cases, I ran into one COBOL program that seemed to parse 50x slower with CongoCC. Well, I couldn't ignore it any longer (if it were really irreconcilably that slow, I would have to rethink the whole effort). So I dusted off a copy of JProfiler that I was kindly given to me 5 years ago and set out to track down the answer. It turns out the answer is, "no, I just did some stupid things in the grammar." Actually, it is more like, "no, the stupid things I had to do in JavaCC are no longer necessary in CongoCC."

One of these was a (now) TOKEN HOOK that maintained a hash map of case-flattened token images to be checked against every token fetched to see if it was "unreserved" by the user and, if so, convert it to a COBOL_WORD token and then press on. Now it just creates an EnumSet of the "unreserved" tokens at parser initialization time (by running each "unreserved" string through the lexer) and reducing the TOKEN HOOK to just seeing if the current token is contained in the set and, if so, doing its thing. This change sped up the test parse by 10x.

Another stupid thing was that the chaotic nature of LOOKAHEAD in JavaCC had ended up causing me to sprinkle unnecessary (now that nesting works) LOOKAHEADs in places that essentially resulted in the entire program being scanned multiple times. Fixing this in just a couple of places (there are more, I am sure) sped up the parse by another 4x.
So now the troublesome program is about 35% slower than JavaCC 😃
BTW, the lexer state machine fraction of the total parse time in this case is around 6%.

revusky

adMartem

Well, this all sounds pretty good. Frankly, it's hard for me to get excited about a report that the Congo generated parser is 35% slower, say. I mean the original JavaCC is such a primitive, simplistic tool really. That the generated code might run a bit faster would hardly even be surprising. OTOH, 50x difference is really pretty unacceptable. And that, actually, is what the speed difference with ANTLR typically is. It really seems that it's something like that. But, anyway, to use the legacy JavaCC because the code it generates runs 30,40, or 50% faster would just be a terrible trade-off, I'm pretty sure.

Today I was playing around with original JavaCC, just because I wanted to see what code it generated for various cases. And honestly, I had forgotten what a bloody-minded simplistic tool it really is. I think people credit it as being something much more sophisticated than it really is because the whole parser generation space is enveloped in this sort of obfuscated jargon.

I guess what does make the parser generation space kind of challenging is just that the tool is a code generator, so when you hit a bug, you're pretty much always one extra degree of separation away from the problem, as compared to when you just write code directly by hand. So a bug manifests itself in code that was generated and to find the bug, you have to trace back to the problem where it was generated, not the code itself. The problem with the original JavaCC project is that it always eschewed the use of templates. When the code is generated with a series of println trying to find any bug is like... When you're generating from a template, the template still kind of resembles the output. It's still challenging, but you get used to working with the templates. So, you know, you see things like: https://github.com/javacc21/javacc21/blob/master/src/ftl/java/ParserProductions.java.ftl or https://github.com/javacc21/javacc21/blob/master/src/ftl/java/LookaheadRoutines.java.ftl which are really the most nitty-gritty templates that generate the parser/lookahead code.

But, I mean, just how much more clearly you can express certain things when you have things like up-to-here notation and assertions and then lexical state and token activation/deactivation that actually works in conjunction with lookahead. Oh, and contextual predicates...

Of course, the problem has been that the whole thing is less solid than I thought, because some of these features do interact in screwy ways at the moment, though I've been gradually beating it into shape. I think the last issue you brought up is fixed. In my defense, I would point out that probably if one just restricted oneself to using the features that already existed in legacy JavaCC, the tool is pretty solid. In that original feature set, there are probably fewer bugs in Congo than in the original JavaCC. The bugs we're hitting (and that I am in the process of squashing...) relate to features that simply never existed in the original JavaCC. (And never will!)

adMartem

revusky
I certainly agree that anything performance-degradation-wise better than a binary order of magnitude is worth the cost in this application given the fragility of maintaining and/or developing a large original javacc-based grammar. It's especially clear for a grammar to parse a language like COBOL that has had essentially no single point of origin and has undergone evolutionary changes over a period of 60 years.

adMartem

As I have continued to profile and knock down hotspots in the grammar by making better use of up-to-here, reordering some choice expansions that were accommodating issues with the original javacc lookahead, and exposing it to more variety of large COBOL programs, I've observed that when both parsers are fully warmed up CongoCC takes from 1 to 3 times as long to parse. In most cases it is between 1 and 1.5, but when the program consists of very large data divisions (where data declarations occur in COBOL) and relatively small procedure divisions (where the procedural code is) the bulk of the time seems to be in the lexer (70+%). This is anecdotal at the moment, but it looks like this will be the salient difference. Unfortunately, real-world COBOL programs often have many (e.g., thousands) of included "copy books" each of which defines sometimes thousands of variables. It is not unusual for the parser to be dealing with effective source sizes of 100,000+ lines consisting of 95% data declarations.
Of course these declaration can produce many tokens, and that is probably what is going on.

vsajip

adMartem Thank you for all the feedback you are giving - it is really useful to see the tool come up against real-world use cases, and it can only improve as a result of this feedback! Indeed, that has already happened.

« Previous Page