A performance data point

adMartem

First, a little background. I currently have a COBOL -> Java compiler/runtime for which I use javacc/jtb for parsing and building the AST. A sub-grammar of the main COBOL grammar is one used for encoding the COBOL "picture" strings. For that I also used javacc/jtb and encoded the picture by visiting the AST in a single pass.

I discovered CongoCC (ne. Javacc21) about a month ago, and became very excited, since every time I have to touch the COBOL grammar I am reminded of how fragile and unpredictable javacc is. The problem is that I use jtb, not jjtree (not a show stopper, just some tedious work to change the multiple passes over the AST, or hack CongoCC to create the optional nodes like jtb) and I rely on some of javacc's bugs to accomplish some things in the current parser. However, I decided to recode the picture parser using CongoCC to get my feet wet, so to speak. Now, I have the result. It is not exactly apples-to-apples. My new CongoCC picture parser (picture encoder in my parlance) does not generate a tree (although it did along the way to its final form). Instead the semantics are all rolled into the grammar and performed during parsing, but because of the simplicity of the former visitor logic, I suspect it is not a very significant difference.

The CongoCC-based picture encoder is about 2x faster than the javacc/jtb one! Although in the actual compiler the picture encodings are cached, so the performance of the encoder is not so important, I was still happy with the results as an indication that I can remain excited about possibly replacing javacc/jtb entirely.

The main COBOL grammar is very large (around 12,000 lines), so there is some work to do, but I am now highly motivated. And CongoCC is now entrenched in the product as the new picture encoder!

So goodbye javacc, and thanks for all the fish.

Oh, and here is the grammar if anyone is interested https://drive.google.com/file/d/1Va5z4mAiWh1EQ9epGqB7suWivO2umor7/view?usp=sharing

revusky

Hi,

It's quite gratifying to hear that this is working so well for you. It's interesting what you say about performance being 2x better with CongoCC (at least for your case) because the truth of the matter is that there has been very little work on performance issues or profiling generally. As for jtb, I have to admit that I am not really familiar with it at all. I recall one person saying that he preferred jtb to jjtree but I never really looked into it. As far as I can see, both JTB and JJTree were created around the same time, around 1997, but JJTree had a big advantage because it was part of the base JavaCC package, not a separate download from somewhere else, so I assume it is used much more widely. I don't think either one was really developed very much afterwards, like in the last quarter century. So they generate code that doesn't use generics or anything in modern Java, I mean anything added to Java in the 21st century! (That actually was the idea behind calling this improved version JavaCC 21!)

I looked at the grammar you linked and I think you tend to write SCAN in a lot of spots where it is superfluous. Typically, if you're using what I call the "up-to-here" marker, you don't need to write SCAN.

So, in the ancient days, you would write:

    LOOKAHEAD(Foo() Bar()) Foo() Bar() Baz()

But then later, in JavaCC 21, you could write:

    SCAN Foo Bar => Foo Bar Baz

which is nicer, and one can still write that, but the repetition bothered me and I figured one could write:

     Foo Bar =>|| Baz

which means that we scan (or look ahead) for the Foo Bar part and if we find that, we parse the whole Foo Bar Baz. But there is no need to write SCAN in front of that.

There are also some things where there are maybe too many ways of writing something. At some point, I realized that this pattern where you write:

      LOOKAHEAD(Foo() Bar() Baz()) Foo() Bar() Baz()

is common. I mean, the expansion you scan ahead for is the exact same one that you then parse. So I figured you could just write:

        LOOKAHEAD Foo Bar Baz

and it would mean the same thing. However, I later figured it could be written more tersely as:

       => Foo Bar Baz

But that was before I came with the "up-to-here" notation so the above can also be written:

     Foo Bar Baz =>||

And I don't know if there is so much value in having so many different ways of writing the same thing.

I am considering whether, with the transition to the CongoCC naming, it will only allow the last way above of writing this. But, if I do that, I'll have an automatic syntax conversion utility that converts existing grammars to the approved way of writing it.

As for the 12000 line COBOL grammar you mention, I guess you could get a fair bit of value out of just having the INCLUDE directive so you can break it up into smaller files.

Well, anyway, if you have any suggestions or anything to report, don't hesitate!

adMartem

Thanks for the tip. I somehow missed the fact that the "SCAN" was superfluous when using the "=>||" up-to-here.

Yes, JTB came out at about the same time and, I observe, is equally wedded to the "nothing burger" approach of its sibling. I began using it about 10 years ago, and had to fix several problems with it before it worked for me, even though it had been around for years. When I went to check a couple of years ago to see if they had fixed my problems, I found that they had done nothing significant except make the code difficult to merge with my changes. I merged it anyway and found a host of new problems which I had to fix. I then decided I would never attempt to update my version with theirs again. BTW, the major goal of their work seemed to be adding fields in nodes for the try/catch/finally structures in the grammar that captured the text of the catch and finally part. I have no idea why that is useful.

Regarding what JTB is compared to JJtree, I'm not sure either since I didn't use it. But based on what you have in CongoCC, for my use case the difference is that JTB creates fields in each non-terminal Node for every anonymous child expansion and token using the types NodeToken, NodeOptional, NodeList, NodeListOptional, and NodeSequence (for a non-top-level expansion in parentheses). It also creates fields for every non-terminal. Of course it also puts other things in the Node to support visitors, but that is essentially homomorphic with CondoCC Nodes and can be made visitor-compatible via simple injection.

I'm still pondering how to deal with the fields since I would like to do an incremental transition to a CongoCC grammar without changing all of the visitors initially (there a half a dozen visitors and the largest one is 22K lines (ugh!). At first glance it seems like adding another layer of Node creation for the JTB types would not be that disruptive to the architecture of CongoCC. (But right now I'm focused on just getting the grammar parsing the same COBOL correctly.) The target, however, would be to provide named expansions for everything the visitors use that currently uses the JTB anonymous types, which would remove all traces of JavaCC/JTBism and be orders of magnitude more robust and transparent.

Currently I have just gotten to the point that I get a clean compilation for the COBOL parser, and I just ran into large tree in the road. See my subsequent post for that...

Thanks again for all you have done; I really think it is massive progress for the (active) Java parsing community.

revusky

adMartem Yes, JTB came out at about the same time and, I observe, is equally wedded to the "nothing burger" approach of its sibling.

Well, one striking thing, in particular about JJTree, is that the approach the tool takes is that of being an add-on that treats the core JavaCC as a "black box". It's essentially a sort of pre-processor to JavaCC grammars, allowing some extra annotations relating to tree-building, and then generating a regular, unannotated JavaCC grammar with the generated code actions that correspond to the tree-building in the annotations. My strong sense of things had always been that JJTree was not written by the same team that developed JavaCC. I actually asked Sriram Sankar about this at some point somewhat over a year ago and he confirmed to me that JJTree was written by another guy, Rob Duncan, who was not part of the original JavaCC team. And that goes a long way to explaining why the JJTree has this very half-baked feel to it, like it was sort of bolted onto the original JavaCC. Perhaps the biggest single oddity in JJTree is that Token does not implement Node. Surely the most natural way of implementing an AST would be that the Token objects are the terminal nodes in the tree. But you can't do that because a Token is not a Node.

There is also the very strange fact that you can inject code into the generated XXXParser or XXXTokenManager but not in a generated ASTXXX.java Node subclass. That alone makes the tool borderline unusable IMHO. Because when you want to add some fields or methods to a generated XXXNode.java you have to post-edit the generated file. So now you can't really do a clean rebuild, which would be to just wipe out all the generated .java files and rebuild, because if you did that, you would end up deleting the extra methods that you added by hand!

So, I guess the bottom line is that the thing has these very bizarre impedance mismatch sorts of problems that are, properly understood, a result of the tool being written by a separate person who did not have any ability (or confidence?) to modify how anything in the core JavaCC tool worked.

Now, as for JTB, I have to think that came into existence in the very early days of JavaCC. I suppose that, for the first year approximately of JavaCC's existence, it had no automatic tree-building ability, not even the bolted-on JJTree thing. So some grad students (I think) at UCLA who were using JavaCC extensively wrote the JTB thing because (quite understandably!) they got tired of rewriting basically the same tree-building actions every time they worked up a new JavaCC grammar! So they wrote JTB to automate that. So I infer that JTB developed some following out there but probably would never have existed if JavaCC had had the JJTree thing from the start. But the funny thing is that all this is circa 1997 when JavaCC was not open source. It was free to download, but the source was not available. Until mid-2003. So it is perfectly understandable that whoever wrote JTB treated the core JavaCC tool as a total black box. However, that JJTree takes that approach is rather strange because, presumably, whoever was working on JJTree had access to the JavaCC source code, and could tweak it to support what they were doing on JJTree. But I guess the person didn't dare touch the core JavaCC code. And, actually, if you ever looked at the legacy JavaCC codebase, you can see why they didn't dare touch it!

But, you know, as regards the nothingburger concept, I think that when people in a project are terrified of touching the core code, that is really the biggest symptom of nothing-burgerism. I think so, but in any case, I daresay that just about anybody who has kicked around in the world of software development knows what "nothingburgerism" is, even if they never used that term.

I was thinking that another way of looking at this, maybe from the other way around, is that any active, living, breathing software project is, to a very large extent, engaged in a continual battle against entropy. There is this tendency for the codebase to become full of ad-hoc patches to address whatever particular needs, and the thing can gradually become this entangled mess, something totally intractable. So there is a need to do some regular refactorings and cleanups to prevent the codebase from becoming something totally intractable. So, another way of characterizing a "nothingburger" it's that it's a project that has completely given up on the aforementioned battle against entropy. They ran up the white flag long ago, basically. Though, it's not just that, because sometimes people just drift away from a project and don't do much with it, and that's normal. A full-blown "nothingburger" like the legacy JavaCC/JJTree is a project that is dead (because it has given up totally the battle against entropy) but there are people utterly devoted to the (bizarre) goal of pretending that the thing is an active development effort. And that really is something excruciating to observe -- the empty motions that these people go through, month after month, year after year, eventually decade after decade with this JavaCC thing, trying to present a dead project as something that is real and active.

I honestly don't know why people do this.

adMartem

Regarding the try/catch/finally construct in JavaCC, my COBOL uses it all over the place to try and resync the parsing after a syntax error. Things like this

{}
{
    try {
      
        IdentificationDivision()
        [ EnvironmentDivision() ]
        [ DataDivision() ]
        [ ProcedureDivision() ]

    } catch (ParseException pe) {
        failAndSkipToBefore(pe, ErrorCode.BAD_PROGRAM,
            Tokens.sequence(IDENTIFICATION,DIVISION),
            Tokens.sequence(ID,DIVISION),
            ENVIRONMENT_DIVISION,
            DATA_DIVISION,
            PROCEDURE_DIVISION,         
            EOF);
    }
    
}

Which will become (in CongoCC) something like

ProgramUnit :
    IdentificationDivision
    [ EnvironmentDivision ]!
    [ DataDivision ]!
    [ ProcedureDivision ]!
;

I think.
Woohoo!

revusky

adMartem Regarding the try/catch/finally construct in JavaCC

I don't know whether you came across this blog article in which I made no bones about my skepticism regarding the usefulness of the try-catch construct in legacy JavaCC. I mean, given the fact that parsers generated by legacy JavaCC include basically zero disposition for backtracking or recovery...

Though, to be clear, you can still use this construct in CongoCC, I never removed it, but my doubts about its usefulness remain. Well, to be clear, I wouldn't say either that it is totally impossible to recover and get back on the rails somehow, at least in some cases maybe, but it would really be some kind of black art! And also, it is quite possible that you can do something or other that is better than nothing -- which is kind of what I'm guessing is your case in the above. So, I mean, I'm not saying that you can't do anything in a catch block, but let's face it. It's bound to be something quite crude! (Which again, may well be better than nothing, but...)

But that blog article, which I see was written 2 years ago (how time flies!) I mentioned that I had implemented an alternative to try-catch, which is ATTEMPT/RECOVER. The code is still there for that, but it is basically untested. You see, normally, when I implement a new feature, I try to use it internally pretty quickly. So, the up-to-here stuff I was talking about, for example, that's used so extensively internally that, by now, it is as solid as any other core feature really. But that is not the case with the various fault-tolerant/error-recovery sort of stuff. I was gratified that 'ngx' seemed to be putting the fault-tolerant stuff to good use, but I do need to make 100% clear that this sort of stuff is nowhere near as tested as other things.

With the ATTEMPT/RECOVER I describe in that blog post from 2 years ago, I implemented it and I must have done a bit of checking to see that it seemed to work, but I never got any feedback from anybody about it one way or the other and it's not being used internally, so I can't even affirm that it still works! Possibly, due to code drift, it doesn't even work! It seems that the fault-tolerant stuff described here does basically work, though I mostly know that from the recent posts by ngx.

Well, I think the approach described is basically valid and it does work (mostly) but I have to make clear that if you use it, you should be doing so on the understanding that it's not as solid/tested as other things and really, you should be trying to help me get the kinks out, reporting any problems you run into.

But all that said, fault-tolerant is where it's at. So I anticipate that, once we get the CongoCC renaming fully done, probably fault-tolerant will be on by default. Unless you explicitly turn it off. (And besides, how do you get anything tested anyway, as a practical matter... people out there do have to start using it!)

Now, as for the ATTEMPT/RECOVER, the idea there is well motivated, I think. Well, the basic difference is that with ATTEMPT/RECOVER, the parser machinery (if you hit an error in the ATTEMPT block) rewinds to the state it was in just before the ATTEMPT block. So I thought this would be more generally useful! But, again, I'm not even 100% sure that ATTEMPT/RECOVER still works! But if it doesn't and you have a clear use-case, we'll beat it into shape.

So that's about where things stand...

adMartem

Yes, I did see the blog post regarding this, and that was one of the things that caused me to see if I could move from Javacc/JTB to CongoCC. As you say, the use of TCF to recover (somewhat) from parse exceptions was barely usable with great effort and tedium. All it really can do is eat tokens from wherever it is to some kind of "landmark" where parsing can attempt to continue. In COBOL, that works in a lot of cases, because the language is fairly rigid regarding the sequence of constructs, and places massive importance on the "period-separator", essentially a ". " token, which is the terminator of "sentences" which happens to be a good sync point. Unfortunately it cannot be 100% relied on, so you have to sync on sometimes dozens of other tokens that might let you recover. All together a bad way to do it.

So, my goal would be to get rid of the TCF stuff in the grammar and use the ATTEMPT/RECOVER instead. I'm happy to consider it experimental and subject to change.

For the JTB stuff, I was hoping to cobble together enough to allow the parser to generated JTB compatible fields in the AST nodes in order to use the existing visitors to make sure there weren't any performance showstoppers, and then throw out the JTB stuff and revise the visitors to use the #<nonterminal-name> JJTree notation to create named nodes instead of using the JTB generated fields like nodeOptional, nodeChoice, etc. The source-compatible visitation itself is easily taken care of by INJECTing an accept method to Node like this:

default public void accept(Visitor visitor) {
        visitor.recurse(this);
    }

But for now, I am getting the grammar itself ready to compile the 12,000 or so COBOL regression tests and make sure it seems to work with my revisions (like "=>||"). I'll try the ATTEMPT/RECOVER capability also and let you know how that goes.

adMartem

Well, I'm finally back on this after having to take a month on a more boring project. Several items to report.
I tried enabling "fault_tolerant" and use ATTEMPT/RECOVER and "!" re-sync but that causes the actual "Code too large" problem to occur. Apparently the method calls to form the additional sets of tokens for that push it over the byte-code limit for the parser class (parser is over 200K lines, incidentally). I tried some quick mods to use an enum for each set of tokens (i.e., FirstSetEnum) with each value comprising the EnumSet of tokens, but that had the same problem because (of course) the enum's static initialization also is part of the class static initialization. At that point I decided I would just let you know about the problem ,since anything further I would do would suffer from my naivety. In solving a similar problem with Java generated by the COBOL compiler I had to resort to using a single large string to initialize what would otherwise be separate static initializers. That would have been my next step, namely, initializing the enums' values with a single string that would be decoded in the constructor to populate the enum sets.

Regarding anecdotal performance so far, the full COBOL parser seems to be between 2 and 5 times slower than the JavaCC one when warmed up. I'm not super worried about that at this time, however, because 1) I haven't spent any effort to optimize the productions for CongoCC and 2) the parse time for COBOL (grammar + tree building) is only a small fraction of the time Java takes to compile the code I generate. Overall compiler performance is, however, important to our users since they often have a single COBOL subprogram that generates several hundred thousand lines of Java. In these cases we can explain away the Java time as "sorry, you wanted Java and it has to be compiled" but not the preprocessing and Java generation time. So eventually I'm sure I will need to pay attention to this.

So right now I am pressing on with attempting to get the CongoCC parser to parse everything the JavaCC parser can (I have an experimental system with a parsing phase that uses both grammars and then goes on the use the JavaCC built tree for later passes). So the lack of fault tolerant stuff right is not blocking anything except my curiosity to see how much more scrutable and robust it would make my JavaCC-based quasi-"fault tolerance".

BTW, my biggest problem is that the Java debugger can't step to any line # >65535! Makes debugging very difficult when the class is 150K lines or so. I have to manually rearrange the code so that the parts I think I will need to step into are below the limit.

revusky

adMartem Regarding anecdotal performance so far, the full COBOL parser seems to be between 2 and 5 times slower than the JavaCC one when warmed up.

Now, regarding this, I'd be very happy to try to get to the bottom of it, but I guess we have to narrow it down between two main possibilities. As you surely understand, there are a couple of separate machines operating, principally tokenization and parsing/tree-building. We need to know if it's the tokenization that is significantly slower or the actually parsing part.

Offhand, it seems like your COBOL grammar has a VERY large lexical grammar (I mean in terms of the number of distinct NFA states generated) and it could be that this code (I mean the scanning code the tool generates) gets quite a bit slower when we have that many NFA states.

If that is not where the slowdown is, then my best guess would be that it has to do with syntactic lookaheads. There are cases where legacy JavaCC has a huge speed advantage over JavaCC 21 (not exactly a legit speed difference really...) simply because legacy JavaCC ignores nested syntactic lookahead. So if you do syntactic lookahead over highly nested structures where the inner constructs also use syntactic lookahead, you can find that JavaCC21 is much slower, but the speed difference is actually kinda spurious, because the legacy tool is achieving the higher speed by simply not doing any of the nested syntactic lookaheads! But that might or might not be your problem, of course. I think the above two are the main possibilities anyway.

Oh, and by the way, I feel I should also make the point that the code generation of this tool is not optimized for performance at all. It has never been a priority because, well, for one thing, there is a general sense that, pragmatically speaking, there is not a really serious performance problem with the code it generates. And certainly, almost all effort has been focused on correctness basically.

So, if there comes the time to actually focus on performance of generated code, there probably is a fair bit of low-hanging fruit in terms of improving things. I mean, that's just never been a focus of our efforts. So far...

revusky

Hi, I appreciate the feedback!

First of all, I think you can probably (even almost certainly) get rid of the code too large problem wrt FIRST_SET generation by tweaking the following line: https://github.com/javacc21/javacc21/blob/master/src/ftl/java/CommonUtils.java.ftl#L49

Basically, just try numbers lower than 8. In fact, if you just deleted that #elseif block completely even, that would surely do it too. The 8 there is quite arbitrary in any case. I just figured that the "code too large" would not be an issue if you generated a separate XXXinit method if there were 8 or more token types in the first set. But if you just generate a separate _init() method for every last first set, then it would surely be okay. (I think so, anyway....)

Well, let me know...

adMartem

revusky I will try this tonight.

adMartem

revusky Regarding this, I made a change to the lexer template to do the inner NFA loop twice each time it's called so I could find the total % time spent there (the old two simultaneous linear equation trick). It turns out that for the current COBOL grammar about 20 to 30% of the total parsing time is spent in that do ... while (after the parser is warmed up and across a variety of reasonably sized COBOL sources).

adMartem

A question if I may. I am trying to construct a rule that only allows an expansion if a certain non-terminal does NOT occur following it. An example is the following:

MoveCatena :
    ( Identifier =>|| | Literal ) <TO> IdentifierList
    | ( <CORRESPONDING> | <CORR> ) CorrespondingIdentifier <TO>
    ( SCAN CorrespondingIdentifier [<COMMACHAR>] AssignmentOperator => FAIL | CorrespondingIdentifier [<COMMACHAR>] )+
;

I've tried several variants that I thought might work, but no luck so far. Essentially what I want is for the lookahead to FAIL if the next-in-line CorrespondingIdentifier is followed by an AssignmentOperator (which would mean it does not belong to this MOVE statement, but rather to a subsequent assignment statement to be parsed next).

Am I on the wrong track? I was hoping to do this in a clean way, rather than the rather ugly way I do it in JavaCC (with a semantic predicate that manually scans forward over the identifier to detect a following assignment). Conceptually something equivalent to a "~" in the middle of a scan specification is what I am trying to do. Like SCAN Foo ~Bar => Foo

revusky

adMartem

adMartem SCAN Foo _Bar => Foo

I think something like this:

 Foo ASSERT ~(Bar) =>||

ought to work....

adMartem

revusky Thanks!

adMartem

revusky Worked perfectly!

revusky

adMartem

Did you resolve your "code too large" problem with the XXX_FIRST_SET generation?

adMartem

I tried reducing the number to 1 (which would seem to always use the init() method) and still got the "code to large". I was planning to do a little more research before replying, specifically verifying my recollection that the init() method call would be generated in the class static initialization method. If so, that is probably exceeding 64K. In my earlier experimental approach (with a single enum for FirstSet with values for each enum set) I was hoping it would generate the FirstSet enum's initialization privately, but I concluded all enums' values (i.e., the constructor calls in this case) are performed in the class's static initializer too, thus not solving the problem. I believe that general approach would work if each FirstSet value could be expressed as a single entry in the class literal table.. For example,
what is currently static private final EnumSet<TokenType> first_set$p3cobol_ccc$7195$9= tokenTypeSet(ALL, CHARACTERS, FIRST, LAST, LEADING, TRAILING); where I tried p3cobol_cc$7195$9 (ALL, CHARACTERS, FIRST, LAST, LEADING, TRAILING) as the corresponding FirstSet value (FirstSet has something like 1,000 values in all) would have to be something like p3cobol_cc$7195$9 ("ALL/CHARACTERS/FIRST/LAST/LEADING/TRAILING") and the FirstSet constructor would need to split the string and actually form the EnumSet inside the constructor. Then in the code a first_set$p3cobol_ccc$7195$9 reference would simply become First_set.p3cobol_ccc$7195$9
Anyway, bottom line is the problem is not yet solved 🙁

revusky

adMartem

Well, "code too large" is really a pretty silly, trivial problem at root. As you surely understand, the problem is that a single Java method cannot compile into more than 64K bytecodes. And the way this is usually a problem is in terms of static initialization code. So if you have:

static private EnumSet<TokenType> Foo_FIRST_SET = EnumSet.of(A, B, C,...);
static private EnumSet<TokenType> Bar_FIRST_SET = EnumSet.of(D, E, F, ....);
 .... 1000 more lines like this maybe...

All of these static initializations end up being compiled into a single static initialization method and that method, can easily end up being over 64K in bytecode.

But that is hardly such a big problem really. The general solution to the above is to break the initialization up into multiple methods none of which pass the 64K limit. So, for the above, you would need to generate something more like:

static private EnumSet<TokenType> Foo_FIRST_SET, Bar_FIRST_SET, BAZ_FIRST_SET, etc. etc. ;

So, you define the fields and then, assuming that initializing them in a single method hits the "code too large" issue, you need to have your initialization in multiple methods, like:

 static void  static_init1() {
       Foo_FIRST_SET = EnumSet.of(A,B,C,....);
       Bar_FIRST_SET = EnumSet.of(D,E,F,....);
       and so on...
 }

And then:

 static void static_init2() { 
       the next group of initializations
 }

...

 static void static_init10() {
       the final group of initializations
  }

And then elsewhere:

 static {
     static_init1();
     static_init2();
     ....
     static_init10();
  }

Well, in the above, there are 10 static initialization methods, but it could be something else. It's probably not too hard to do some rough calculations as to how many of them you actually do need and then generate them. So it's a question of mucking with the template to generate code more like the above. So, if you are comfortable editing FreeMarker templates, you could well have a go at it.

Or alternatively, I guess you could send me your grammar by private email and I'll see if I can get it working.

I really thought I had nailed the various "code too large" scenarios and then you came along and were still running into this. Well, I guess I declared victory prematurely, kind of like George Bush on that aircraft carrier. But regardless, we will have our clear victory over this. (Unlike George Bush...)

But, you know, the whole thing is such a testament to how stagnant some of these projects are. Generating code that doesn't compile because of "code too large" is major PITA in both legacy JavaCC and ANTLR. (Though actually I just googled a bit and it may be that this was a big problem in ANTLR3 and they refactored the code generation to avoid it in ANTLR4. Not sure...) But regardless, it is just amazing because it's a trivial problem and the ostensible project maintainers of JavaCC have never tamed this issue in 20 years or so!

Actually, their modus operandi is that any problem like this is just a "known issue" somehow and it is just accepted that they're never going to address it.

Oh well, as a veteran JavaCC user, you know all that, I guess! LOL.

revusky

adMartem current COBOL grammar about 20 to 30% of the total parsing time is spent in that

Well, that sounds pretty reasonable to me then. I guess if you really did want to do some performance optimization, there would be a better chance of finding some gains elsewhere. The truth of the matter is that there is probably a fair bit of low-hanging fruit in terms of tightening up the generated parser code as I've never made any serious attempt to do much on that front. It's been entirely about having it work correctly. In fact, if you look at all closely at the generated parser code, you probably note that it's really pretty straightforward and bloody-minded. There's really been no attempt to optimize it at all. I've never done any profiling to find bottlenecks or any of that.

But I don't think it's the right place to apply our energies now anyway, but that said, if you do have some straightforward improvements on that front, just clear optimization that doesn't come at much cost in terms of readability/maintainability, then sure, by all means...

adMartem

revusky The approach you outline is essentially what we do for COBOL variable initialization in our generated code. Basically, java generated for COBOL has to avoid the static initializer limit, constant pool limit, and the normal method byte-code limit. I'll try some changes to the template to do as you suggest when I think I have the entire grammar working in non-fault-tolerant mode (not quite there yet). At that point can also send you the grammar if I still have problems. Thanks.

adMartem

revusky I agree it seems pretty reasonable. I'm not inclined to worry about performance at this point anyway, since I think it is more important to get the entire process through tree building working correctly for all 12,000 or so regression tests, something that is a work-in-progress. I will certainly let you know of anything I might mind in this area, of course.