A performance data point

adMartem · Jun 20, 2022

A question if I may. I am trying to construct a rule that only allows an expansion if a certain non-terminal does NOT occur following it. An example is the following:

MoveCatena :
    ( Identifier =>|| | Literal ) <TO> IdentifierList
    | ( <CORRESPONDING> | <CORR> ) CorrespondingIdentifier <TO>
    ( SCAN CorrespondingIdentifier [<COMMACHAR>] AssignmentOperator => FAIL | CorrespondingIdentifier [<COMMACHAR>] )+
;

I've tried several variants that I thought might work, but no luck so far. Essentially what I want is for the lookahead to FAIL if the next-in-line CorrespondingIdentifier is followed by an AssignmentOperator (which would mean it does not belong to this MOVE statement, but rather to a subsequent assignment statement to be parsed next).

Am I on the wrong track? I was hoping to do this in a clean way, rather than the rather ugly way I do it in JavaCC (with a semantic predicate that manually scans forward over the identifier to detect a following assignment). Conceptually something equivalent to a "~" in the middle of a scan specification is what I am trying to do. Like SCAN Foo ~Bar => Foo

adMartem · Jun 20, 2022

revusky I will try this tonight.

revusky · Jun 20, 2022

adMartem

adMartem SCAN Foo _Bar => Foo

I think something like this:

 Foo ASSERT ~(Bar) =>||

ought to work....

adMartem · Jun 21, 2022

revusky Thanks!

adMartem · Jun 22, 2022

revusky Worked perfectly!

revusky · Jun 22, 2022

adMartem

Did you resolve your "code too large" problem with the XXX_FIRST_SET generation?

adMartem · Jun 23, 2022

I tried reducing the number to 1 (which would seem to always use the init() method) and still got the "code to large". I was planning to do a little more research before replying, specifically verifying my recollection that the init() method call would be generated in the class static initialization method. If so, that is probably exceeding 64K. In my earlier experimental approach (with a single enum for FirstSet with values for each enum set) I was hoping it would generate the FirstSet enum's initialization privately, but I concluded all enums' values (i.e., the constructor calls in this case) are performed in the class's static initializer too, thus not solving the problem. I believe that general approach would work if each FirstSet value could be expressed as a single entry in the class literal table.. For example,
what is currently static private final EnumSet<TokenType> first_set$p3cobol_ccc$7195$9= tokenTypeSet(ALL, CHARACTERS, FIRST, LAST, LEADING, TRAILING); where I tried p3cobol_cc$7195$9 (ALL, CHARACTERS, FIRST, LAST, LEADING, TRAILING) as the corresponding FirstSet value (FirstSet has something like 1,000 values in all) would have to be something like p3cobol_cc$7195$9 ("ALL/CHARACTERS/FIRST/LAST/LEADING/TRAILING") and the FirstSet constructor would need to split the string and actually form the EnumSet inside the constructor. Then in the code a first_set$p3cobol_ccc$7195$9 reference would simply become First_set.p3cobol_ccc$7195$9
Anyway, bottom line is the problem is not yet solved

adMartem · Jun 29, 2022

revusky Regarding this, I made a change to the lexer template to do the inner NFA loop twice each time it's called so I could find the total % time spent there (the old two simultaneous linear equation trick). It turns out that for the current COBOL grammar about 20 to 30% of the total parsing time is spent in that do ... while (after the parser is warmed up and across a variety of reasonably sized COBOL sources).

revusky · Jun 30, 2022

adMartem

Well, "code too large" is really a pretty silly, trivial problem at root. As you surely understand, the problem is that a single Java method cannot compile into more than 64K bytecodes. And the way this is usually a problem is in terms of static initialization code. So if you have:

static private EnumSet<TokenType> Foo_FIRST_SET = EnumSet.of(A, B, C,...);
static private EnumSet<TokenType> Bar_FIRST_SET = EnumSet.of(D, E, F, ....);
 .... 1000 more lines like this maybe...

All of these static initializations end up being compiled into a single static initialization method and that method, can easily end up being over 64K in bytecode.

But that is hardly such a big problem really. The general solution to the above is to break the initialization up into multiple methods none of which pass the 64K limit. So, for the above, you would need to generate something more like:

static private EnumSet<TokenType> Foo_FIRST_SET, Bar_FIRST_SET, BAZ_FIRST_SET, etc. etc. ;

So, you define the fields and then, assuming that initializing them in a single method hits the "code too large" issue, you need to have your initialization in multiple methods, like:

 static void  static_init1() {
       Foo_FIRST_SET = EnumSet.of(A,B,C,....);
       Bar_FIRST_SET = EnumSet.of(D,E,F,....);
       and so on...
 }

And then:

 static void static_init2() { 
       the next group of initializations
 }

...

 static void static_init10() {
       the final group of initializations
  }

And then elsewhere:

 static {
     static_init1();
     static_init2();
     ....
     static_init10();
  }

Well, in the above, there are 10 static initialization methods, but it could be something else. It's probably not too hard to do some rough calculations as to how many of them you actually do need and then generate them. So it's a question of mucking with the template to generate code more like the above. So, if you are comfortable editing FreeMarker templates, you could well have a go at it.

Or alternatively, I guess you could send me your grammar by private email and I'll see if I can get it working.

I really thought I had nailed the various "code too large" scenarios and then you came along and were still running into this. Well, I guess I declared victory prematurely, kind of like George Bush on that aircraft carrier. But regardless, we will have our clear victory over this. (Unlike George Bush...)

But, you know, the whole thing is such a testament to how stagnant some of these projects are. Generating code that doesn't compile because of "code too large" is major PITA in both legacy JavaCC and ANTLR. (Though actually I just googled a bit and it may be that this was a big problem in ANTLR3 and they refactored the code generation to avoid it in ANTLR4. Not sure...) But regardless, it is just amazing because it's a trivial problem and the ostensible project maintainers of JavaCC have never tamed this issue in 20 years or so!

Actually, their modus operandi is that any problem like this is just a "known issue" somehow and it is just accepted that they're never going to address it.

Oh well, as a veteran JavaCC user, you know all that, I guess! LOL.

revusky · Jun 30, 2022

adMartem current COBOL grammar about 20 to 30% of the total parsing time is spent in that

Well, that sounds pretty reasonable to me then. I guess if you really did want to do some performance optimization, there would be a better chance of finding some gains elsewhere. The truth of the matter is that there is probably a fair bit of low-hanging fruit in terms of tightening up the generated parser code as I've never made any serious attempt to do much on that front. It's been entirely about having it work correctly. In fact, if you look at all closely at the generated parser code, you probably note that it's really pretty straightforward and bloody-minded. There's really been no attempt to optimize it at all. I've never done any profiling to find bottlenecks or any of that.

But I don't think it's the right place to apply our energies now anyway, but that said, if you do have some straightforward improvements on that front, just clear optimization that doesn't come at much cost in terms of readability/maintainability, then sure, by all means...

adMartem · Jul 4, 2022

revusky I agree it seems pretty reasonable. I'm not inclined to worry about performance at this point anyway, since I think it is more important to get the entire process through tree building working correctly for all 12,000 or so regression tests, something that is a work-in-progress. I will certainly let you know of anything I might mind in this area, of course.

adMartem · Jul 4, 2022

revusky The approach you outline is essentially what we do for COBOL variable initialization in our generated code. Basically, java generated for COBOL has to avoid the static initializer limit, constant pool limit, and the normal method byte-code limit. I'll try some changes to the template to do as you suggest when I think I have the entire grammar working in non-fault-tolerant mode (not quite there yet). At that point can also send you the grammar if I still have problems. Thanks.

adMartem · Jul 28, 2022

Good news on the performance front! In the process of getting the grammar to work with more and more test cases, I ran into one COBOL program that seemed to parse 50x slower with CongoCC. Well, I couldn't ignore it any longer (if it were really irreconcilably that slow, I would have to rethink the whole effort). So I dusted off a copy of JProfiler that I was kindly given to me 5 years ago and set out to track down the answer. It turns out the answer is, "no, I just did some stupid things in the grammar." Actually, it is more like, "no, the stupid things I had to do in JavaCC are no longer necessary in CongoCC."

One of these was a (now) TOKEN HOOK that maintained a hash map of case-flattened token images to be checked against every token fetched to see if it was "unreserved" by the user and, if so, convert it to a COBOL_WORD token and then press on. Now it just creates an EnumSet of the "unreserved" tokens at parser initialization time (by running each "unreserved" string through the lexer) and reducing the TOKEN HOOK to just seeing if the current token is contained in the set and, if so, doing its thing. This change sped up the test parse by 10x.

Another stupid thing was that the chaotic nature of LOOKAHEAD in JavaCC had ended up causing me to sprinkle unnecessary (now that nesting works) LOOKAHEADs in places that essentially resulted in the entire program being scanned multiple times. Fixing this in just a couple of places (there are more, I am sure) sped up the parse by another 4x.
So now the troublesome program is about 35% slower than JavaCC
BTW, the lexer state machine fraction of the total parse time in this case is around 6%.

revusky · Jul 31, 2022

adMartem

Well, this all sounds pretty good. Frankly, it's hard for me to get excited about a report that the Congo generated parser is 35% slower, say. I mean the original JavaCC is such a primitive, simplistic tool really. That the generated code might run a bit faster would hardly even be surprising. OTOH, 50x difference is really pretty unacceptable. And that, actually, is what the speed difference with ANTLR typically is. It really seems that it's something like that. But, anyway, to use the legacy JavaCC because the code it generates runs 30,40, or 50% faster would just be a terrible trade-off, I'm pretty sure.

Today I was playing around with original JavaCC, just because I wanted to see what code it generated for various cases. And honestly, I had forgotten what a bloody-minded simplistic tool it really is. I think people credit it as being something much more sophisticated than it really is because the whole parser generation space is enveloped in this sort of obfuscated jargon.

I guess what does make the parser generation space kind of challenging is just that the tool is a code generator, so when you hit a bug, you're pretty much always one extra degree of separation away from the problem, as compared to when you just write code directly by hand. So a bug manifests itself in code that was generated and to find the bug, you have to trace back to the problem where it was generated, not the code itself. The problem with the original JavaCC project is that it always eschewed the use of templates. When the code is generated with a series of println trying to find any bug is like... When you're generating from a template, the template still kind of resembles the output. It's still challenging, but you get used to working with the templates. So, you know, you see things like: https://github.com/javacc21/javacc21/blob/master/src/ftl/java/ParserProductions.java.ftl or https://github.com/javacc21/javacc21/blob/master/src/ftl/java/LookaheadRoutines.java.ftl which are really the most nitty-gritty templates that generate the parser/lookahead code.

But, I mean, just how much more clearly you can express certain things when you have things like up-to-here notation and assertions and then lexical state and token activation/deactivation that actually works in conjunction with lookahead. Oh, and contextual predicates...

Of course, the problem has been that the whole thing is less solid than I thought, because some of these features do interact in screwy ways at the moment, though I've been gradually beating it into shape. I think the last issue you brought up is fixed. In my defense, I would point out that probably if one just restricted oneself to using the features that already existed in legacy JavaCC, the tool is pretty solid. In that original feature set, there are probably fewer bugs in Congo than in the original JavaCC. The bugs we're hitting (and that I am in the process of squashing...) relate to features that simply never existed in the original JavaCC. (And never will!)

adMartem · Jul 31, 2022

revusky
I certainly agree that anything performance-degradation-wise better than a binary order of magnitude is worth the cost in this application given the fragility of maintaining and/or developing a large original javacc-based grammar. It's especially clear for a grammar to parse a language like COBOL that has had essentially no single point of origin and has undergone evolutionary changes over a period of 60 years.

adMartem · Jul 31, 2022

As I have continued to profile and knock down hotspots in the grammar by making better use of up-to-here, reordering some choice expansions that were accommodating issues with the original javacc lookahead, and exposing it to more variety of large COBOL programs, I've observed that when both parsers are fully warmed up CongoCC takes from 1 to 3 times as long to parse. In most cases it is between 1 and 1.5, but when the program consists of very large data divisions (where data declarations occur in COBOL) and relatively small procedure divisions (where the procedural code is) the bulk of the time seems to be in the lexer (70+%). This is anecdotal at the moment, but it looks like this will be the salient difference. Unfortunately, real-world COBOL programs often have many (e.g., thousands) of included "copy books" each of which defines sometimes thousands of variables. It is not unusual for the parser to be dealing with effective source sizes of 100,000+ lines consisting of 95% data declarations.
Of course these declaration can produce many tokens, and that is probably what is going on.

Vvsajip · Aug 1, 2022

adMartem Thank you for all the feedback you are giving - it is really useful to see the tool come up against real-world use cases, and it can only improve as a result of this feedback! Indeed, that has already happened.