Legacy Glitches in Lookahead

revusky

Hard-core JavaCC aficionados -- well, I mean those who started using the tool back in the early days and have possibly only transitioned to JavaCC 21 fairly recently -- are probably aware that LOOKAHEAD has always been broken in various ways. Simple usage tended to work well enough, but any attempts to do anything beyond a certain threshold of sophistication using LOOKAHEAD were liable to lead to one running full speed against a wall. I suppose that this was generally understood among "power users" but it is hard to tell, since legacy JavaCC users are, by and large, an amazingly long-suffering group of people. They never seem to have gotten very vocal about complaining about these various things -- perhaps because it was always well understood that there was little point; nothing was ever going to be done to address these problems anyway.

Actually, in terms of lookahead, I'm pretty sure that I fixed the single biggest problem with this a couple of years ago. See here. Nonetheless, there is one pretty major glitch remaining. I've been aware of this for some time and it has been kind of like a thorn stuck in my mind, but am only now turning my attention to the problem. The problem is that the scanahead logic generates code that is not quite correct when it comes to scanning ahead in a choice construct.

Consider the following grammar:

TREE_BUILDING_ENABLED=false;
BASE_NAME="Test";

INJECT TestParser :
{
    public static void main(String args[]) {
        new TestParser(System.in).Root();
    }
}

SKIP : " " ;

NestedChoice :
   SCAN 2 "foo" "bar" "baz"  
   |
   "foo"
;

Root :
    SCAN NestedChoice => {System.out.println("lookahead succeeded");} NestedChoice
    |
    {System.out.println("lookahead failed");}
;

You can build the above little example like so:

java -jar javacc-full.jar <grammar filename>
javac TestParser.java

And then it can be tested on the command-line with:

java TestParser

And you input a line to pass into the parser.

java TestParser
foo bar baz

Now, the above example is designed to illustrate a certain point. Consider the NestedChoice production. It contains a binary choice: it will consume the three tokens "foo", "bar" and then "baz" (that is the first choice) OR it will consume a single "foo" token, the second choice.

However, there is a subtle point here. If the next two tokens off the stream are "foo" and then "bar", it does not check for the "baz". Thus, if the next token is not a "baz", it will not go to the next choice and consume a single "foo". This is because, once we have matched the predicate, which is the first two tokens, we have decided (correctly or not!) that we must go into the first choice. Another way of framing the issue is to realize that the next token, "baz", is not a choice point. At this point, having passed the first 2-token lookahead that was specified, we are committed to this branch and so, if the next token is not a "baz", we are going to hit an error.

So, here is the summary of what input the NestedChoice production can match:

The three tokens "foo", "bar", then "baz"
The single token "foo" as long as it is not followed by a "bar"

Examples of matching input:

"foo" "bar" "baz"
"foo" (something other than "bar")

Examples of non-matching input:

Anything that does not start with "foo"
"foo" "bar" followed by something other than "baz"

Note that if the next tokens off the stream are "foo", "bar", "foo", it does not matter that the second option, the lone "foo", would have worked. In the first option, we scan ahead two tokens, so if the first two tokens match, we are committed to that first option. That is how the tool's logic works. If we wanted to check the first three tokens before committing ourselves, not just the first two, we could have written SCAN 3 or the line could be written "foo" "bar" "baz" =>||. (As things stand, the line with SCAN 2 could be written equivalently as "foo" "bar" =>|| "baz")

Now let us consider the other production in the grammar, Root. It is also a choice. The first line starts with:

 SCAN NestedChoice =>

meaning that we are going to scan ahead to see if a NestedChoice production can be matched. Here is where the plot thickens. If we run the test harness:

java TestParser

and we feed it the input:

foo bar bar

We get this output:

imac:~/projects/javacc/legacy_test>java TestParser
foo bar foo
lookahead succeeded
Exception in thread "main" ParseException:
Encountered an error at (or somewhere around) input:1:9
Was expecting one of the following:
BAZ
Found string "foo" of type FOO
    at TestParser.handleUnexpectedTokenType(TestParser.java:512)
    at TestParser.consumeToken(TestParser.java:504)
    at TestParser.NestedChoice(TestParser.java:203)
    at TestParser.Root(Test.javacc:20)
    at TestParser.Root(TestParser.java:228)

The problem is that, by all rights, the lookahead SCAN NestedChoice should have failed! But it succeeded because the choice logic when scanning ahead is not the same as when parsing. When you are parsing, the "foo" bar" sequence must be followed by a "baz" for the production to successfully parse. This is because we are committed to the first choice in NestedChoice once the first two tokens match. If the next token is not what is expected, we do not go on to the next branch, which is matching a single "foo" token. We just hit an error condition.

But when we are scanning ahead, we do go to the next option and see if it matches. This would be the right thing to do if the predicate had failed. So if the input is "foo" "baz", then the predicate for the first choice did not match, so we go to the next choice, which is correct, because, since the predicate (matching the first two tokens) failed, we never got committed to the first branch.

Another way of looking at all of this is that there are actually two ways that a sub-expansion in a choice construct can fail.

The predicate (if none is specified, this amounts to just checking the first token) fails.
The predicate succeeds but then there is a failure elsewhere, at a non-choice-point.

So here are the two key cases:

If the predicate of the sub-expansion in a choice fails, then we move on to check the next sub-expansion in the choice construct.

If the scanahead of the sub-expansion in a choice fails BUT the predicate succeeded, then we should NOT go to the next choice. The scan of the choice construct as a whole is considered to have failed.

In other words:

If the nested lookahead succeeded in a choice construct, there should NOT be an attempt to match any subsequent sub-expansions in the choice construct.

This is not currently dealt with correctly. When we are scanning ahead (as opposed to parsing) we go to the next sub-choice regardless of whether the predicate succeeded or not! This is really a major glitch that dates back to legacy JavaCC. It has always been there. Of course, this sort of problem is far worse in legacy JavaCC because nested syntactic lookahead is simply ignored completely in legacy JavaCC. JavaCC 21 does address this, but we are left with this problem that the logic is still not quite right.

Again, this is clearly a bug, because the principle of least surprise really requires that if we write:

 SCAN SomeProduction => SomeProduction

or in the legacy syntax:

 Lookahead(SomeProduction()) SomeProduction()

or more economically in the new syntax:

 SomeProduction =>||

we should expect the semantics of the scanahead to be the same as in the parsing. If the lookahead succeeded, we should expect the parsing to succeed. However, in the example I give here, the lookahead can succeed yet the parsing can still fail.

What to Do?

Well, this is a bug and the thing to do (quite uncontroversially, I would say) is to fix it. However, it must be admitted that the bug-fix is not absolutely backward compatible. It is possible that there are grammars that rely on this working the way it works currently -- regardless of it being wrong!

My current tendency is to fix the bug in JavaCC 21, but to put in a setting allowing one to turn on the old (screwy) way of working.

When the CongoCC transition is done, we won't take that setting with us. In general, this is a recurring theme. Certain things will be kept working in JavaCC 21 but won't be around in CongoCC. I do not anticipate any support for legacy JavaCC syntax in CongoCC. Even some things that were added in JavaCC 21 will not be brought over to Congo. (Hey, if you're gonna go on a trek in the jungle, you don't want to carry a bunch of unnecessary junk with you, eh?)

In JavaCC 21, you can write:

  SCAN Foo Bar Baz

or:

  => Foo Bar Baz

and that is the same as:

  Foo Bar Baz =>||

I think we're only taking the last of the three to Congo. But I anticipate having an automatic converter tool that converts older deprecated syntax to the newer syntax, so you can run that over a legacy grammar and things like:

void Foobar() : 
{}
{
    "foo" "bar"
}

will just get rewritten as:

 Foobar : "foo" "bar" ;

And so on. And the streamlined syntax is all that will be supported.

Well, in closing, looking back, I realize that I haven't done much with this for at least a half year, but I was sort of absorbed with other things. I'm writing this on this discussion forum (as opposed to the similarly quiet blog because I am painfully aware that the thing has been quiet for a long time and I would like to see if I can remedy that state of affairs. So, certainly, for anybody who is lurking, please feel free to say something. Don't worry about saying anything dumb. I say dumb things often enough myself!

adMartem

Ok, I'll jump in. As I recall, in solving at least one of my earlier problems using JavaCC 21, I ran across something like (or maybe the same) behavior you describe. At the time, I had made the assumption that the "choice sequence" taken in scanning was always identical to that to be taken when parsing. I discovered empirically that it wasn't, and adjusted some productions to account for it. But my first thought was that it felt like it should work the way I expected, as I see now you too feel. When musing at that time about what that might entail, I concluded that some sort of memoization of the choices taken in scanning could perhaps be used to replay the choices when parsing, for as long as they lasted, thus accomplishing two goals. First, it would do what a user would naturally expect, and two, it could save much of the effort of re-checking all the predicates to get there (depending on implementation). But that then left the issue of semantic lookahead (what is the proper term for that in JavaCC 21 -> Congo?). If it is active in scanning, can its answer change in parsing? If active in scanning, does it truncate replay of choices? If it did, wouldn't it maybe be more natural to assume that the {...} meant "evaluate this predicate only when scanning", (since it this is true, it can't affect parsing) and {...}# meant evaluate at both scanning and parsing? At the time, my answers would have been "yes", but as I write this I realize the question of side-effects was not in my mind at the time (since almost all of my use of this is strictly to steer parsing).

revusky

adMartem t the time, I had made the assumption that the "choice sequence" taken in scanning was always identical to that to be taken when parsing. I discovered empirically that it wasn't, and adjusted some productions to account for it.

Yes, but the problem does predate JavaCC 21. I think that, generally, there should be a sort of "contract" that when you have something like:

  SCAN some_expansion => some_expansion

(which, of course, can be written more tersely as: some_expansion =>||)

that if the expansion succeeds in scanning, it also succeeds in parsing. Well, at least, assuming that the expansion is entirely syntactic. Once you have Java code in there, then, well, conceivably something happens when in scanning but not in parsing or vice versa. Well, for example, consider the definition of AssignmentExpression in the Java grammar.

AssignmentExpression :
  {
    Expression lhs;
  }
  TernaryExpression {lhs = (Expression) peekNode();}
  [
     SCAN 1 {lhs.isAssignableTo()}
    => AssignmentOperator Expression
  ]
;

If you're scanning ahead an AssignmentExpression, the lhs variable is not even in scope and, of course, the code snippet {lhs = (Expression) peekNode();} is not present in the scanning routine, and similarly, the lhs.isAssignableTo() condition is only checked when parsing, not scanning. You see, the problem is that not all expressions are assignable, so if you just had a grammar that said:

   AssignmentExpression : TernaryExpression [AssignmentOperator Expression] ;

you would be accepting nonsense like:

  f(x) = 7;

or:

 x + 3 = y;

But the check that the left-hand-side is assignable can only work if you are parsing because it relies on the tree-building, and that is only active when you're parsing, not when you're scanning. So, the upshot is that if we are scanning ahead, we can accept something like f(x) = 7; but then, to have a really correct parser, it fails when it actually tries to parse the construct, because then it hits that semantic check.

A similar thing happens with modifiers. In a scanahead, you might as well scan past all the modifiers until you reach a "class" token, let's say, so suppose you have input like:

public private class Foo {...}

or say:

abstract final void foo() {...}

Well, I think the most practical approach is just scan past the modifiers and not worry about whether it really makes sense but then when you actually parse the input, you then do a sanity check that would be expressed in Java (or whatever the target language) code.

Well, my point is just that it is quite conceivable that a scanahead succeeds but the parsing then fails, and that can be quite deliberate, but that would be because you have some sanity check that is only carried out in the parsing stage.

As far as I can see, if the expansion is purely syntactic (i.e. no code actions or conditions expressed in code) then a scan succeeding and the parse succeeding should be effectively equivalent. So, legacy JavaCC simply ignoring nested syntactic lookahead was always a blatant violation of that contract!

Now, as for the rest of your comment, the reference to memoization, I've thought along those lines. There is a general problem that there can be a lot of superfluous scanahead. For example, if you have:

  Foo Bar =>|| Baz Bat

then you scan ahead to see if you can scan a Foo followed by a Bar and then when you deal with the expansion as a whole, you scan Foo Bar Baz Bat as a whole and, yes, you end up scanning Foo Bar twice!

Actually, the bloody minded way it's implemented now, the case of:

Foo Bar Baz Bat =>||

it just ends up scanning twice in the nested lookahead. It scans the predicate and then scans the expansion -- which is the same thing!

(Of course, if this is a non-nested lookahead, then, yes, it scans the expansion and then parses it, but if you're in a nested lookahead, it just does the same thing twice! And I also know that this can get very expensive in a heavily nested construct, if you're just scanning the whole thing to the end.)

There are these various issues that have been in the back of my mind for a while. So, I thought about memo-ization a bit, but actually, I think that gets very complicated to implement and there are more practical solutions to this problem that would be simpler. And the fact is that a bit of superfluous scanning is not usually such a big deal practically speaking. I'm pretty sure that, as a practical matter, it's only a real problem when you scan into deeply nested recursive constructs. I think, by the way, this must be why some ANTLR grammars are so ridiculously slow. But I've never really looked into that. But I have heard from various people that they tried to move their crufty old JavaCC generated parser to ANTLR and just couldn't because the performance hit was just too great. Of course, the other possibility to consider is that JavaCC-generated parser may have been running much faster because they just ignore the whole nested lookahead issue, so, looking at it from that POV...

I have some other thoughts about all this, but I'll close this message here...

adMartem

revusky Yes, but the problem does predate JavaCC 21.

I didn't mean to imply that it didn't. Yes, I had numerous issues with this in the legacy JavaCC. With nested syntactic lookahead working, I had to reorganize some productions, in particular the high-level ones that used lookahead, but now had lower levels actually doing something. With legacy JavaCC, it was so broken in this area that I didn't notice that lookahead (when it worked) had this characteristic. With JavaCC 21 I actually got to the point of noticing this non-intuitive behavior.

But to the point, I agree completely that for pure syntactic SCANing a successful scan should mean a successful parse. For semantic lookahead I think if it is enabled for both and is idempotent it should behave similarly.

There is a general problem that there can be a lot of superfluous scanahead ...

Indeed. The COBOL parser has a lot of this. The language is so devoid of symmetry that some high-level choices can end up trying to parse the entire program several time to determine what is going on. Even seemingly "simple" statements can suffer from this to an extent. When I was doing a quick-and-dirty performance enhancement to get the Javacc 21 times in the same binary magnitude as JavaCC, pretty much everything I did was based on profiling to find where I was rescanning the same non-terminals an inordinate amount of times and then revising the grammar to "tighten up" the scanning for those cases (now that nested lookahead works, many of the high-level lookaheads could be shortened). Doing so took the parse time from 10-20x down to 1-5x with an average of 2.5. I'm still planning to do more in this area, but what I've done reduced the anxiety that I was going to hit a wall at the end. Of course, regarding lexing, what I saw was that the lexer time was ignorable when the parse time was 10-20, but when it is 2 or 3x, the lexer percentage is not so insignificant. But, right now, that is at the back of my mind.

I am to the point, I think, that my next step will be to tackle the problem of JTB vs JJTree AST nodes. Since I have a LOT of visitor code, and it relies on JTB nodes, my plan is to generate JTB-compatible nodes somehow, make sure everything still works COBOL -> Java, and then see about getting the grammar as efficient as necessary within the complete context. I'll post a different discussion topic on my current view of the JTB issue.

revusky

adMartem Yes, but the problem does predate JavaCC 21.

I didn't mean to imply that it didn't.

I was pretty sure that you knew that it's a longstanding problem, but I guess I just said that for the benefit of anybody else who might be lurking!

As regards the whole question of redundant scanahead, that is kind of an interesting issue. I mean, it's true, for example, that when you step through any parser generated from a fairly complex grammar, it's kinda shocking how much repetitive scanahead there is. But, then, in most cases it doesn't seem to be much of a problem in terms of performance. As a practical question, parsers generated by JavaCC 21 still tend to be pretty fast. And, in fact, scanning ahead a handful of tokens is probably so fast that I suspect that memo-ization might not even produce any performance improvement. (Though there are cases where it would, when you have to repeatedly scan past some very complex nested construct.. but I mean typically.)

Well, for example, if you consider the TypeDeclaration production in the Java grammar which looks like this:

TypeDeclaration #interface :
  SCAN TypeDeclarationLA =>
  (
    EmptyDeclaration
    |
    AnnotationTypeDeclaration
    |
    ClassDeclaration
    |
    InterfaceDeclaration
    |
    EnumDeclaration
    |
    RecordDeclaration
  )
;

Except for the EmptyDeclaration (which is just a lone semicolon) all of the other sub-productions, like ClassDeclaration or InterfaceDeclaration start with zero or more "modifiers", public, private, final etc. or annotations.

So, effectively, the predicate for each of those sub-expansions involves scanning past the modifiers and then peeking to see what the next token is. For example, with ClassDeclaration you have:

ClassDeclaration :
  {permissibleModifiers = EnumSet.of(TokenType.PUBLIC, TokenType.PROTECTED, TokenType.PRIVATE, 
   TokenType.ABSTRACT, TokenType.FINAL, TokenType.STATIC, TokenType.STRICTFP, TokenType.SEALED, 
   TokenType.NON_SEALED);}#
  Modifiers 
  "class" =>||
  TypeIdentifier /name/
  [ TypeParameters ]
  [ ExtendsList ]
 [ ImplementsList ]
 [ PermitsList ]
  ClassOrInterfaceBody
;

So, basically, in terms of deciding whether to enter that production, it scans past the various modifiers, like public, private, etc. and whatever annotations there might be until it finds (or not) the "class" token. (And that actually matches how a human would typically eyeball the code!) But my point is that, when parsing a TypeDeclaration, it will typically scan past Modifiers multiple times. It scans past the Modifiers and then checks for the "class" token, and then if it's not there, it just goes back scans past the very same modifiers again, and if the next token is not "interface", it goes to the next choice. And so on. This is a very natural way to write the TypeDeclaration, I think, but yes, it does involve redundantly scanning past the Modifiers a number of times. However, I'm pretty sure that, as a practical matter, it doesn't matter very much. It doesn't have to re-tokenize, so the redundant scanning is really pretty cheap. It can, however, get expensive when the leading stuff that you have to scan past each time is something very complex with nested constructs that themselves involve nested scanaheads.

But I realized something eyeballing the code that is kind of interesting. You know how you used to have to be able to write:

   LOOKAHEAD (Foo() Bar()) Foo() Bar()

and now you can write:

   Foo Bar =>||

I just realized that in the current implementation, the second way of writing it is not only shorter and clearer, but actually generates more efficient code! (I only just realized this!)

Bear with me and I'll show you why.

Consider the relevant part of the template that generates the code for this. https://github.com/javacc21/javacc21/blob/master/src/ftl/java/LookaheadRoutines.java.ftl#L162-L179

Essentially, it's like this:

[#macro BuildScanRoutine expansion]
  private final boolean ${expansion.scanRoutineName}() {
    try {
       lookaheadRoutineNesting++;
      ${BuildPredicateCode(expansion)}
      ${BuildScanCode(expansion)}
       return true;
    }
    finally {
       lookaheadRoutineNesting--;
    }
  }
[/#macro]

The key line is line 170.

So the scan routine checks the predicate and then checks the expansion. Now, the BuildPredicateCode macro looks like:

[#macro BuildPredicateCode expansion]
   [#if expansion.hasSemanticLookahead && (expansion.lookahead.semanticLookaheadNested || 
    expansion.containingProduction.onlyForLookahead)]
     if (!(${expansion.semanticLookahead})) return false;
   [/#if]
   [#if expansion.hasLookBehind]
     if ([#if !expansion.lookBehind.negated]![/#if]
     ${expansion.lookBehind.routineName}()) return false;
   [/#if]
   [#if expansion.hasSeparateSyntacticLookahead]
      if (remainingLookahead <=0) return !hitFailure;
      if (
      [#if !expansion.lookahead.negated]![/#if]
        ${expansion.lookaheadExpansion.scanRoutineName}()) return false;
      [/#if]
[/#macro]

So it's really not so hard to understand. The line I would call your attention to is:

       [#if expansion.hasSeparateSyntacticLookahead]

If the expansion has separate syntactic lookahead, then we scan through that lookahead expansion. So, if you write:

   SCAN Foo Bar => Foo Bar

(LOOKAHEAD(Foo() Bar()) Foo() Bar() in legacy syntax)

that condition expansion.hasSeparateSyntacticLookahead returns true! Why? Because there is no check for whether the expansion lookahead is the same as the expansion. But if you write it the new approved way:

  Foo Bar =>||

then expansion.hasSeparateSyntacticLookahead returns false and thus, the extra (superfluous) scan is not generated. Of course, this should be addressed. This is also the case for something like:

 Foo =>|| Bar

In the above, the expansion.hasSeparateSyntacticLookahead returns false. So no predicate code is generated so the BuildPredicateCode macro will actually generate nothing and we just go to line 171 in the macro.

This also leads to an interesting realization, albeit a year and a half late. I had some dialogue with a guy from the JSQLParser project. That project was using (and still does) the legacy JavaCC. I helped him get the project's SQL grammar working with JavaCC 21 but he had horrendous performance problems. I knew that the problem was that the grammar (quite horrendous) would frequently do stuff like:

  LOOKAHEAD(SelectStatement()) SelectStatement()

and it was dog slow on some various sample input he had, because the SQL SELECT statement can be very deeply nested and it was doing a full lookahead on each level of nested recursion and.... So it could be very very much slower than the legacy JavaCC generated parser, which, as we know, just ignores all nested lookahead anyway!

I just realized now, a year and a half after all that, that most likely, the whole thing could have been resolved by replacing the above code with:

  SelectStatement =>||

Of course, the real solution is for the thing to have more built-in smarts regarding whether the syntactic lookahead really is separate. I mean, it really should generate the same code for:

SCAN "foo" "bar" => "foo" "bar" Baz

as it does for:

 "foo" "bar" =>|| Baz

It will need to recognize that the lookahead expansion "foo" "bar" is just the first part of the actual expansion to be parsed, i.e. "foo" "bar" Baz.

That's not too hard, I don't think...

But it is kind of interesting (and I just realized it!) that the second, shorter expansion above actually generates more efficient code than the other, more verbose way. And that has been the case for a couple of years, I guess, ever since I fixed the nested syntactic lookahead issue.

adMartem

revusky This also leads to an interesting realization, albeit a year and a half late. I had some dialogue with a guy from the JSQLParser project.

It's a small world. I have a meeting in 10 min. on the subject of supporting SQLJ (close, but no cigar) in our COBOL!

revusky

I'm going to write a blog post about this, but I'll just go over the key points right here.

This is now fixed. However, the fix is not active at the moment by default. I introduced a new setting called LEGACY_GLITCHY_LOOKAHEAD which is on by default. So, the example I gave above works correctly now, BUT only if you put LEGACY_GLITCHY_LOOKAHEAD=false up top. The reason for that is that I came to the conclusion that a high percentage of fairly complex grammars may well be broken as a result of fixing this issue!

This project has 4 fairly complex grammars: Java, Python, CSharp, and the JavaCC grammar itself. (The big four. Something like JSON is not worth counting.) Of those, all but the Java grammar were broken by this fix! Some fix, some might say... But this is the correct behavior and it will definitely be the default in CongoCC.

Now, actually, I just narrowed down the issue with the Python grammar to one point, here.

Assignment :
 SCAN (SimpleAssignTarget (":" | AugAssign)) | (StarTargets "=") =>
 (
  SimpleAssignTarget ":" =>|| Expression ["=" AnnotatedRhs]
  |
  SimpleAssignTarget AugAssign =>|| (YieldExpression | StarExpressions)
  |
  (=>StarTargets "=")+ =>|| (YieldExpression | StarExpressions)
 )
;

The lookahead for assignment, which is the first line (I'll write it in a more readable way here) is:

SCAN SimpleAssignTarget (":" | AugAssign) 
            | 
           StarTargets "="
             =>

You can see how it relies on the somewhat screwy lookahead. Consider input like:

 x = 7

The x matches SimpleAssignTarget but then the next token is = and the first choice fails. (AugAssign is all the assignment operators that are not a simple =, like += or -= etcetera.) Now, based on normal logic that is used in parsing, the lookahead should fail because we passed the (implicit) predicate of a single token. And that means that we don't go to the next choice. In reality, the expansion should be written:

SCAN   SimpleAssignTarget (":" | AugAssign) =>||
             |
            StarTargets "="
            =>

Except (at the moment) that does not work either, because I disallowed putting an up-to-here in a lookahead expansion. (That was a mistake actually. It should just be disallowed at the top level, but not within a choice... but that's another question.)

Still, it can be rewritten as:

   AssignmentLA#scan : 
       SimpleAssignTarget (":" | AugAssign) =>||
       |
       StarTargets "="
   ;

And then the lookahead in Assignment would be:

     Assignment : 
           SCAN AssignmentLA =>
           (
               ...
           )
     ;

Anyway, if you rewrite it as in the above, the Python grammar works with LEGACY_GLITCHY_LOOKAHEAD=false since that was the only point in the whole grammar that relied on that behavior.

Perhaps needless to say, LEGACY_GLITCHY_LOOKAHEAD will be off by default in Congo. However, I may actually keep the setting around, since it is so cheap really. The key points in the code generation where this is implemented are: here, here, and here.

Basically, we jump out of a choice construct if an option fails but the predicate passed. But here, we have the extra condition that legacyGlitchyLookahead is false. If that flag is true, then we don't jump out and we have the legacy behavior.

This occurs in the generation of a choice construct, i.e. X | Y | Z sort of thing. Also, for example in a ZeroOrOne sort of construct, like [expansion] but that can be understood by realizing that, really,

[X]

is the same thing as:

         X | {}

So, if we scan into X and it fails, but we passed the predicate, we don't go to the next choice, which is doing nothing.

So, for example, if we have:

       ["foo" "bar" =>|| "baz"]

if the input is foo bar foo then the above construct fails in a lookahead because once we passed the foo bar we were committed to a baz. And that logic should be the same whether we are parsing or scanning ahead.

This is only fixed for Java code generation, BTW. I'm hoping that somebody (most likely Vinay) will adjust the Python and CSharp templates accordingly. Mostly just a question of looking at where the variable passedPredicate is used.

adMartem

The COBOL grammar relied on only one glitch of the form [foo] bar | .... Works now with all the regression tests in non-glitch mode.

There are still some to find in the Report Writer and OO parts of the grammar, but they are only scanned and pitched, and they are not tested in regression (probably should be now that I think about it, since they should get through parsing without error).