Well... "The Dreaded “Code too large” Problem is a Thing of the Past

adMartem

Just when I though it was safe, I got my COBOL grammar converted and noticed a single error in this generated file:
https://drive.google.com/file/d/1eShTXtJw1bn04lg3UHTN9QEcM8RvIqGh/view?usp=sharing

The error is "Too many constants, the constant pool for CobolNfaData would exceed 65536 entries".

In my experience, everything about COBOL ends up creating this kind problem in Java (at least it does with the Java we generate). In particular, typical programs seem to use hundreds of thousands of unique constants so we have had to create our own literal space, and they have single nested if statements that can go on for thousands of lines of COBOL code, but those problems I've solved and that's another story.

Now it has attacked my parser. Any suggestions?

revusky

adMartem

Well, first of all, as you probably realize already, the "Code too large" problem that is described here is due to the fact that a single method cannot be more than 64K in bytecode. And that problem really is solved. What you are hitting is a different (though similar enough) problem, the size limit (also 64K) in the constants pool. I was not aware of this issue because I had never run into it. In fact, it appears that I was never even close to running into it.

But yeah, it appears that if your lexical grammar generates too many NFA states, then you hit this problem. And "too many" seems to be on the order of something greater than about 8000. So I guess each NFA state, the way this is coded anyway, ends up adding about 8 bytes to the constants pool.

I would say that this seems to be a pretty rare problem. (Though, not to worry, we'll get it fixed!) The biggest lexical grammars that I have lying around are the ones for FreeMarker and C#. The Java lexical grammar is a bit smaller and the JavaCC lexical grammar is basically the same as the one for Java, except that there are a few extra things, so it's slightly larger. Well, the largest single lexical grammar I have is FreeMarker (but only by a bit) generates 1187 NFA states. The next largest one, the C# lexical grammar generates 947 NFA states. So with a limit of around 8000, these very big grammars (by most standards) are still nowhere near hitting the limit.

adMartem Here is it probably the number of methods, so maybe some kind of multi-level dispatching to anonymous classes would be a solution (to get to a logarithmic reduction in constants in a single pool).

Yeah, that must be the solution, I think. If we have greater than some threshold number of NFA states, we just put the NFA_XXX methods in some inner classes and we should be okay, since each inner class would have the 64K limit for the constants pool.

One thing I am wondering about in the back of my mind is whether having this large a number of NFA states causes a significant performance degradation in lexing. Well, that question can be examined once we get this working.

adMartem

Here's the constants file needed to compile the NFA data file:
https://drive.google.com/file/d/1n66-_PQ8b84z5Wk3h-hbaYtMFPDi4yzc/view?usp=sharing

adMartem

Rough estimate, looking at the class file for the picture parser as guidance, is that there are about 8 constant table entries per LexicalState/NFA + misc real constants like numbers. In the case of the full COBOL lexer above, there are about 5000 NFAs for each of two lexer states and another 2000 or so for smaller states, making about 12000*8+(about 5000+ ints)+misc for the number of constant pool slots needed. About 50% more than the maximum available. If it were actual numeric literals that were the problem tricks like forming integer literals arithmetically at runtime would do the trick, but that doesn't seem to be the problem. Here is it probably the number of methods, so maybe some kind of multi-level dispatching to anonymous classes would be a solution (to get to a logarithmic reduction in constants in a single pool).

revusky

adMartem

Okay, well, I tweaked the NfaData.java.ftl template so that this should be fixed.

Basically, I just have it generate a separate inner class for each lexical state and that inner class holds the
various NFA_XXXX stuff for that lexical state. So, if you pick up the latest version, either via git pull followed by ant jar, or just grab the latest binary build at https://javacc.com/download/javacc-full.jar it should work, but do report back.

Now, as things stand, one could still hit the problem if you had >8000 NFA states in a single lexical state. So your largest lexical states generate about 5000 NFA states so you're still some ways away from that. Aside from your COBOL grammar, I don't know of any grammar that has a single lexical state that generates as many as 1000 NFA states, so....

Actually, if your lexical grammar is such that you have >8000 NFA states for a single lexical state, I think it would also hit "Code too large" because the XXX_init() method would be too big. (Though it could be broken into multiple pieces, but I never bothered to even check for that, because I tended to think it never happened!)

If anybody still hits code too large after this tweak, well, it can still be addressed. It would just be a question of breaking the inner class that represents the problematic lexical state into multiple pieces. But I just did the simplest tweak that would work for now. Not that it is obligatory or anything, but you might find it interesting to see how small the diff is that addresses this: https://github.com/javacc21/javacc21/commit/017967763b0740e34dcfc8b72ca32f5ef755bf8f

(Or I think it addresses this. You tell me!)

adMartem

Thanks! I will try this in the next day or so and report back. I do something similar in the COBOL code generator when a single paragraph (corresponds to a method in generated Java) would likely exceed the 64K byte limit. In that case, I have seen almost no negative performance impact. Hopefully you will see the same in this case.
As you probably noticed, there are two lexical states with around 5000 NFA states. The reason for that is that COBOL has a feature triggered by the statement DECIMAL IS COMMA, which causes the "," to become the decimal point indicator and "." to become equivalent to "," in printed or edited numbers. But everywhere else the "." retains its normal meaning, usually as a "sentence" terminator. Hence two lexical states for most of the language. It could be done more selectively, but there would still be at least one lexical state with around 5000 NFA states, I suspect.
And, yes I used some poetic license with my title.

adMartem

That seems to have done the trick. Thanks for the quick response. Now I can work on replacing a global LOOKAHEAD=3 by selectively augmenting LOOKAHEADs now that they can be nested!

revusky

adMartem global LOOKAHEAD=3

Yeah, well, actually, JavaCC 21 (and therefore CongoCC) does not support setting a global lookahead other than 1. At some point (and I can't remember exactly when) I just removed that option. So, the default lookahead amount is now always 1.

In general, though, I think that numerical lookahead is actually a very screwy, error-prone sort of idea to start with. (Leaving aside a globally set lookahead amount that is other than 1, which seems like an extra bad idea generally). I mean, like, if you have:

   LOOKAHEAD(2) "foo" bar" "baz"

the alternative using up-to-here notation would be:

   "foo" "bar" =>|| "baz"

And it's not just that this achieves a much greater economy of expression, but I think, more importantly: this actually corresponds much better to how a human thinks about the problem.

I really think so. A human trying to read code thinks to himself: Well, I scan up to this point and then I can stop because I know that this must be a method declaration. A human does not think: "I need to scan exactly 7 tokens ahead". And besides that, that approach does not even work in the general case! I mean, there's no specific number of tokens you can scan ahead to identify a class declaration, say, and I think that's usually the case with the main constructs in any fairly complex language. So, in the Java grammar, you have:

ClassDeclaration :
  {permissibleModifiers = EnumSet.of(TokenType.PUBLIC, TokenType.PROTECTED, TokenType.PRIVATE, 
  TokenType.ABSTRACT, TokenType.FINAL, TokenType.STATIC, TokenType.STRICTFP, TokenType.SEALED, 
  TokenType.NON_SEALED);}#
  Modifiers
  "class" =>|| 
  TypeIdentifier
  [ TypeParameters ]
  [ ExtendsList]
  [ ImplementsList ]
  [ PermitsList ]
  ClassOrInterfaceBody
;

We scan past whatever modifiers there are (if any) and then, once the next token is "class", well, we don't need to scan any more, this definitely must be a class declaration! That's how it's expressed and that's how a human thinks about the problem. (Well, this human anyway!)

And, of course, even if a numerical lookahead can be used, it still tends to be more fragile, so if you have:

   LOOKAHEAD (4) Foobar() Baz()

because a Foobar production always consumes 3 tokens but you also need to check whether the next token in Baz is the right one as well, so it's a lookahead of 4 tokens, but if, due to whatever language evolution, you rewrite the Foobar production so that it is potentially 4 or 5 tokens, say... then all these LOOKAHEAD(n) things have to be rewritten to reflect that. But using up-to-here-plus notation, we can write:

    Foobar =>|+1 Baz

And that doesn't need to be changed. It's much more robust than the code that uses a numerical lookahead. And again, I think it also corresponds more closely to how a human being actually thinks of the problem.

So, in the Java grammar here you have:

Initializer# :
    [ "static" ] =>|+1 Block
;

This could admittedly be expressed with a syntactic lookahead, but it has to be there every time you use the Initializer production at a choice point. And it's really kind of ugly:

LOOKAHEAD (["static"] "{") Initializer()

But if you use a numerical lookahead, you would have to write:

LOOKAHEAD(2) Initializer()

But there it is actually looking ahead 2 tokens in many cases when it doesn't need to. It only needs to look ahead 2 tokens if the next token is "static". Otherwise, a single token lookahead is enough. Actually, I suppose if one were going to look at what I tend to think of as best practices in terms of writing a grammar, the best reference could well be the C# grammar. That, by the way, was quite difficult to write and it really stresses the CongoCC feature set to the limit! I'm pretty sure that the legacy JavaCC is simply not powerful enough to write a decent grammar for that language.

adMartem

Yep, that was the motivation for needing to replace the LOOKAHEAD=3 which is no longer in CongoCC. It was there in the first place because (years ago) I was in a situation where no matter how I used LOOKAHEAD, it couldn't seem to eliminate parsing problems with COBOL. Finally, I discovered changing LOOKAHEAD to 3 fixed all the problems, and didn't seem to add any new ones, so I left it there. But it always bugged me until I looked at the generated java and realized that the LOOKAHEAD did not nest. Then it just became a permanent irritant. With CongoCC I am resetting to the LL(1) assumption and cleaning up the grammar to use hopefully minimal =>|| scanning instead.