A problem (revealed by) ASSERT in lookahead

adMartem

revusky
Yep, that fixed it. Thanks. Now, however, I get a host of new warnings of the form "The expansion inside this (...)? construct can be matched by the empty string so it is always matched. This may not be your intention." I understand what it is trying to tell me, but in the case of the grammar the non-terminal at the choice point is like this one:

MnemonicNameReference :
    SCAN {isMnemonicName(getNextToken())}# => CobolWord | FAIL
;

Is it not true that the FAIL alternative should suppress the aforementioned warning (since it cannot choose the non-terminal unless there is a CobolWord present? And just now, looking at this, it would seem the FAIL is unnecessary and the semantic predicate should also prevent this from ever matching the empty string, right?

adMartem

adMartem
Actually, I tried it without the FAIL and it didn't give the warning. In one case it was an error message (couldn't match the following expansions) and that too went away. So it seems the FAIL is what is causing the problem. I would guess the message is because with the FAIL alternative causes the non-terminal to always be chosen at a decision point, but it may not consume any tokens, not realizing that the FAIL will never be chosen because of implicit lookahead failure in that case.

revusky

adMartem
WEll, a FAIL doesn't consume any tokens, but this is a spurious warning, I think. I guess I subtly rewrote certain things and now it's giving this spurious warning in these spots. Well, I have to look at this.

Well, just bear with me. We'll just gradually squash all these little bugs, not to worry.

revusky

adMartem
Well, the problem was that it was warning about issues like:

(Foo)*

when Foo can match empty input so you're going to get into an infinite loop. But in a case like:

Foo : "bar" | FAIL;

that is a spurious warning. If you had (Foo)* and Foo was:

  Foo : "bar" | ["baz"] ;

so it potentially matches empty input, then having it inside a repeating (....)* is problematic, because Foo always succeeds and you get an infinite loop...

Well, anyway, try it again. I think it's okay now!

adMartem

revusky
I think I still get the warnings and one error on the following:

...
AdvancingPhrase :
  ( <BEFORE> | <AFTER> ) [ <ADVANCING> ]
  ( <PAGE>
  | MnemonicNameReference =>||
  | ( Identifier =>|| | IntegerConstant | NumericFigurativeConstant) [( <LINE> | <LINES> )]
  | <TO> [ <LINE> ] (Identifier =>|| | IntegerConstant) [ [<ON>] <NEXT> <PAGE>] //TODO: implement this
  )
;
...

The (error) message is Error: /Users/jmb/Development/Local_Repositories/p3cobol/src/main/congocc/p3cobol.ccc:7974:5:This expansion can match the empty string.The following 2 expansions can never be matched.
The error goes away if I remove the FAIL, but the rest of the warnings on other expansions in different productions remain.
I double-checked that I was building from the latest pull from Javacc21 [5d66d704].

adMartem

Here's another one (more exciting) than previous one.
I have the following snippet in the grammar:

...
CombinableCondition :
    SimpleCondition =>|| | AbbreviatedRelationCondition =>|| | <LPARENCHAR> ASSERT ~(AbbreviatedCondition <RPARENCHAR> ArithmeticOperator) AbbreviatedCondition <RPARENCHAR> =>||
;
AbbreviatedRelationCondition :
    (   
            RelationalOperator ArithmeticExpression
        |   [ <NOT> ] RelationalOperator ArithmeticExpression
        |   [ <NOT> ] ArithmeticExpression =>||
//      |   ZERO/ZEROS/ZEROES shadowed by ArithmeticExpression()
        |   SignCondition =>||
    )
;

RelationalOperator :
(
    [ <IS> ] [ <NOT> ]  
        ( 
            SCAN 3 =>   <GREATER> [ <THAN> ] <OR> <EQUAL> [ <TO> ]
        |               <MORETHANOREQUAL>
        |   SCAN 3 =>   <LESS> [ <THAN> ] <OR> <EQUAL> [ <TO> ]
        |               <LESSTHANOREQUAL>
        |               <GREATER> [ <THAN> ]
        |               <MORETHANCHAR>
        |               <LESS> [ <THAN> ]
        |               <LESSTHANCHAR>
        |               (<EQUAL>|<EQUALS>) [ <TO> ]
        |               <EQUALCHAR>[ <TO> ]
        |               <NOTEQUALCHAR>
        |   SCAN {allowJas()}# => <JAS_NE>
        |   SCAN {allowJas()}# => <JAS_EQ>
   )
) =>||
;
...

The input string looks like this : NOT 10 AND 9 AND = 10 ... at the point that CombinableCondition is entered.
What happens is that the AbbreviatedRelationCondition up-to-here scan works fine and passes over the first two choices and succeeds on the third (correct) choice. Then when the AbbreviatedRelationCondition is entered it correctly skips the first choice but (incorrectly) selects the second one based on the first set rather than scanning.

I will try and reduce this to a test case if you need it, but I thought I would let you know right away with this fragment in hopes it is sufficient.

adMartem

adMartem
The funky first two choices are due to the (legal) syntax of "NOT NOT EQUAL 10" in the context of this production. Ugh!

adMartem

adMartem I'm beginning to think this is my (brain's) problem, perhaps masked in earlier CongoCC versions. Is it reasonable to assume that the lookahead will succeed at the same point as the selected non-terminal, or is it the case that I should have resolved the problem with an up-to-here in the 2nd choice of AbbreviatedRelationCondition:
... | [ <NOT> ] RelationalOperator =>|| ArithmeticExpression? I.e., my up-to-here scan was at too high a level.
... ( a little later) ...
Now I'm sure I was wrong-headed when I assumed the behavior I originally described. Short of memoization of the scan to make expansion choices always consistent with lookahead I don't see how it could be implemented the way I assumed. So now the mystery is how it ever worked that way (which it did). I'll have to go back and see what was generated before.

adMartem

adMartem
This is typical of the remaining warnings:

...
WriteCatena :
  RecordName [ <FROM> (Identifier =>|| | Literal) ]
  [ AdvancingPhrase ]
  [ [ At =>|| ] ( <END_OF_PAGE> | <EOP> ) =>|| StatementList ]
  [ <NOT> [ At =>|| ] ( <END_OF_PAGE> | <EOP> ) =>|| StatementList ]
  [ <_INVALID> [ <KEY> ] StatementList ]
  [ <NOT> <_INVALID> [ <KEY> ] StatementList ]
  [ <END_WRITE> ]
;
...
At :
    SCAN {isContextSensitiveWord("at")}# => CobolWord | FAIL
;
...

Warning: /Users/jmb/Development/Local_Repositories/p3cobol/src/main/congocc/p3cobol.ccc:7964:5:The expansion inside this (...)? construct can be matched by the empty string so it is always matched. This may not be your intention. Warning: /Users/jmb/Development/Local_Repositories/p3cobol/src/main/congocc/p3cobol.ccc:7965:11:The expansion inside this (...)? construct can be matched by the empty string so it is always matched. This may not be your intention. occurred at the "At" non-terminal.
When I remove the FAIL the error and warnings all go away.
I guess I probably don't need the up-to-here on the At reference since without the FAIL the SCAN will still be allowed. When I did these I was under the impression that I had to create a choice point in order to add the predicate.

revusky

adMartem

Well, I think there is still a bug in the logic for that warning. I have to look at this more closely. When the final choice in a choice construct is FAIL, then...

Well, not to worry... we'll get this stuff right. In any case, that it's only a warning means that you can disregard it. But the logic of this needs to be fine-tuned.

I do have to say that it is great to have somebody really using all these things in praxis. (Besides the project internally, that is...) Because that really is about the only way to get all this stuff working right.

Well, one aspect of this (that you surely realize) is that the language for expressing the grammar (meta-language to be pretentious...) in Congo/JavaCC21 is really vastly more powerful and expressive than what there is in the original JavaCC. So it is much harder to get everything absolutely right and probably, as a practical matter, the only way to do it is to have noisy, demanding end-users. (Like you.)

revusky

adMartem

I think this is fixed. It was a subtle bug in the sanity check. There is this general problem that the sanity check stuff is meant to catch buggy code, but if the sanity check itself is buggy... I guess that's also a paradox of unit tests and all that. Sure, it's a good idea, you can catch regressions and so on, but if the test itself is buggy....

Though it's maybe a tangent... I myself don't believe in unit tests that much, because I tend to find that if a system is sufficiently complex, the bugs tend to manifest themselves in the conjunction of more than one feature. So unit testing each feature individually can give one a false sense of confidence. And, in any case, I would put more stock in full functional tests than unit tests. We have at least 4 pretty major functional tests of the system, which are the Java, Python, CSharp grammars, and the rebuild/retest of the tool itself, which is written in itself!

Of course, you're hitting these bugs because you are using combinations of things that are not used in the aforementioned functional tests.

revusky

Well, I think the problem you're running into (or maybe it's just one of them) is that I changed (thinking I could get away with it) the way it works as regards using any scanahead specified in a non-terminal.

The way it was before, if you wrote:

   A B C
   |
   D E F

and let's say that B contains an up-to-here, that would be used as long as the preceding expansions were potentially empty, i.e. consumed no tokens. Potentially. So, A could be:

 A: ["foo" | "bar" | "baz"];

which is* potentially* empty. or if the first expansion in the choice above was:

   [A] B C

which amounts to the same thing...

The way it's implemented now, the elements before the nonterminal (say B in this case) must consume no tokens. Since [A] is potentially non-empty, then any up-to-here in B is ignored. But, in principle, you can still have:

     ASSERT {condition1()} {doSomething()} B C 
     |
      ....

And it would use the up-to-here in B, because the elements preceding B do not consume any tokens. (Granted, the code block that is second in the sequence could explicitly call consumeToken() but that's getting entirely too tricky. We do just assume that a Java code block does not consume any input.)

But anyway, the way it was expressed before was that the things preceding it potentially consumed no input. And I surely was thinking about this at some point. I was probably thinking in terms of constructs like:

    Modifiers TypeDefinition

where Modifiers (public, private, static etc.) is potentially empty so maybe the up-to-here is in the TypeDefinition. So there may well be a use-case for this (though none of my internal use was using this).

But finally (very recently) I decided that this was possibly a bit too tricky (not so much to implement as to just document!) and figured that I could get away with changing this so that the nonterminal has to the be the first non-empty sub-expansion in the sequence. I knew this was changing behavior but considered it unlikely that it would affect anybody and also I figured that I could get away with doing this now.

And if you really want to get dirty with the details a bit, this is where this is implemented: https://github.com/javacc21/javacc21/blob/master/src/java/com/javacc/core/NonTerminal.java#L67

So, the current "spec" is that the up-to-here (or SCAN) in a NonTerminal is used if:

 1. The NonTerminal in question is the first non-empty sub-expansion in the sequence
 2. There is no up-to-here (or SCAN) in the enclosing sequence that would have priority.
 3. We're not more than 1 nesting level deep in terms of calling non-terminals or sub-expansions

It could be worth noting that points 1 and 2 are determined at build-time, while point 3 is at run-time, when the parser is actually being run. (Worth noting if you want to develop a conceptual model of how the thing actually works...)

Anyway, the question now is basically:

Could you live with the above semantics?

adMartem

revusky
I think I see what you are saying. Saying it a little differently,
an up-to-here is (recursively) effectively "hoisted" to an enclosing sequence containing its non-terminal if and only if:

The NonTerminal in question is the first non-empty sub-expansion in the enclosing sequence
There is no up-to-here (or SCAN) in the enclosing sequence that would have priority.

Additionally, when processing the grammar with actual input:

The parser is no more than 1 nesting level deep in terms of accepting non-terminals or sub-expansions.

Is that correct? And, if so, can I also assume that any explicit up-to-here in an expansion is always applied at that level of lookahead/acceptance. I.e., the previous rules only apply to "hoisted" up-to-here action, not explicit up-to-here notation.

I can live with that.

The metaphysical problem I have with up-to-here is coming up with a way to think about it while writing productions. But that's my problem, I guess.

Finally, I would assume from this that the correct way to refactor the snippet I gave would be:

CombinableCondition :
    SimpleCondition =>|| | AbbreviatedRelationCondition | <LPARENCHAR>  AbbreviatedCondition <RPARENCHAR> ASSERT ~(ArithmeticOperator) =>||
;
...
AbbreviatedRelationCondition :
    (   
            RelationalOperator ArithmeticExpression
        |   [ <NOT> ] RelationalOperator =>|| ArithmeticExpression
        |   [ <NOT> ] ArithmeticExpression =>||
//      |   ZERO/ZEROS/ZEROES shadowed by ArithmeticExpression()
        |   SignCondition =>||
    )
;
...

i.e., no up-to-here in CombinableCondition (unnecessary), up-to-here on 2nd choice in AbbreviatedRelationCondition (necessary even though RelationalOperator has up-to-here). 1st choice RelationalOperator needs no up-to-here, as it is hoisted to this sequence.

Also, am I correct in assuming that lookahead is independent of acceptance in that the sequence that is checked in lookahead is not guaranteed to be the sequence that is accepted after the choice is taken?

adMartem

revusky
Thanks for your kind words. I know exactly how you feel. I'm glad I tripped over Javacc21 and your humorous narratives. 😃

And now for something completely different...

FNul : [F0] [F1] [F2] [F3] [F4];
Fs : F0 | F1 | F2 | F3 | F4 | FNul;
FsAlt1 : => ( F0 | F1 | F2 | F3 | F4 );
FsAlt2 : ( F0 | F1 | F2 | F3 | F4 ) =>||;
FsAlt3 : F0 =>|| | F1 =>|| | F2 =>|| | F3 =>|| | F4 =>||;

F0 : "one" | "two" | "three" | "four" | FAIL;  
F1 : "one" | "two" | "three" | "four" | => FAIL ASSERT ~("five") | "five";   
F1alt : "one" | "two" | "three" | "four" | => ASSERT ~("five") FAIL | "five"; 
F2 : "eeny" | FAIL | "meany" | "miny" | "moe"; 
F3 : "eeny" | SCAN {false} => FAIL | "meany" | "miny" | "moe";
F4 : "eeny" | SCAN {false}# => FAIL | "meany" | "miny" | "moe";

revusky

adMartem Saying it a little differently,
an up-to-here is (recursively) effectively "hoisted" to an enclosing sequence containing its non-terminal

Well, yeah, if that's more comprehensible to you, given the way your brain works... (everybody is wired a bit differently, I suppose...) Though, actually, looking at what you wrote, I don't quite see the "recursively" part. We're actually not recursing, we're just going one level deep and that's it. Though, reading further, it seems that you understand that perfectly well.

And, as for:

adMartem And, if so, can I also assume that any explicit up-to-here in an expansion is always applied at that level of lookahead/acceptance. I.e., the previous rules only apply to "hoisted" up-to-here action, not explicit up-to-here notation.

Well, yes, this is the way it should work (If I understand what you're saying...) And that's how it will work, but there are currently some issues that need to be addressed, and I guess I'll have to explain that separately.

But, anyway, the specification that is outlined (and I think now is basically implemented correctly) as regards up-to-here in non-terminals, that's not absolutely written in stone yet, I guess. There are a set of things that could be open for discussion, but hopefully, we'll consider it resolved once the Congo rebranding transition is done.

adMartem I can live with that.

Well, I think it's a reasonable, pragmatic approach. Basically, a SCAN or up-to-here only applies in the expansion where it appears and the first nonterminal in a sequence is an exception, and even then, only one nesting level deep.

Well, there are also a few little details wrt parentheses solely used for grouping. If we have:

   (Bar Baz)#BarBaz Bat 
   |
   SomethingElse

then we will respect an up-to-here in Bar. The parentheses around Bar Baz exist for grouping and affect tree-building, for example, but when it comes to up-to-here, it's the same as if it was just: Bar Baz Bat.

Well, I'll write it up, I guess.

adMartem

revusky
Yes, I tend to agree regarding unit tests. Thanks for fixing the spurious warnings and errors. I had them all over my grammar, even after the previous improvement/fix. The reason (they were so plentiful) is that I have several places that use the pattern NonTerminal : SCAN {someCondition}# => CobolWord; which, of course, is not at a decision point, so, after the change was made to more strictly enforce the #1 rule these stopped working. My solution was to turn them into choices like: NonTerminal : SCAN {someCondition}# => CobolWord; | FAIL. This caused lots of the warnings and, in some cases, errors to occur. My little test sample was something I had done to try and see the effect of the interaction of ASSERT, FAIL, and semantic predicates for another purpose, but I noticed it turned into pure warnings and errors when I happened to run it along with some other tests.

Interestingly (or maybe not), in order to get rid of the hard errors I had when the false detection appeared to preclude subsequent choices, I looked at the code and it seemed like the problems stemmed from the fact that FAIL was an EmptyExpansion, and, as such, it returned true to isPossiblyEmpty(). I decided to make a one-line addition:
public boolean isPossiblyEmpty() {return false;} to the Failure INJECTion. It got rid of the errors and warnings, and nothing in my grammar seemed to be broken. Just out of curiosity, is there ever a reason for it to return true?

adMartem

revusky
As I recall, I used "recursively" because I think I noted that if you have:

A : B | "e" "c";
B : D;
D : "e" "f" =>||;

the up-to-here is effectively applied in the first choice of the A production resulting in acceptance of the input "e c" via the second choice. But maybe that has changed since I thought I noticed it. In any case, "recursive" is probably not the way to describe it. It was just what was in my head. I assume

A : B | "e" "c";
B : D "g";
D : "e" "f" =>||;

would fail (to accept "e c") as it would choose B (seeing "e") and then fail to find "f".

adMartem

Part of my confusion, and now the cause of an issue I've been trying to track down, is the following behavior.

A : C ASSERT ~("d") | CD =>||;
CD : "c" "d";
C : SCAN 1 {true}# => "c";

In this example, the lookahead applied at the 1st choice of A first checks the predicate in C, as expected. But it also sets LookaheadRemaining to 1 (essentially overriding the higher-level up-to-here in A). After checking C, the LookaheadRemaining is 0, and so the scan succeeds and returns without checking the ASSERT condition. Is this expected behavior (a lower-level SCAN n ... will reduce an overall UNLIMITED scan to n), or is this a bug?

revusky

adMartem

Well, it seems to me that the way it is behaving is correct -- at least based on the "spec" I set out above. In:

A : C ASSERT ~("d") | CD =>|| ;

As I see it, the most fundamental rule of JavaCC (21 anyway) is that in the absence of any indication to the contrary, we're just scanning ahead one token to resolve the decision at a choice point. So the first choice in the production A above is just going to scan the "c" and then jump out. So it won't hit the assertion. If you wrote:

  A : C ASSERT ~("d") =>||  | CD =>|| ;

then it would scan to the end and would check the assertion and when that fails, go into the next choice CD. OTOH, if you wrote:

 A: ASSERT {getToken(2).getType() != TokenType.D}# C || CD ;

then I guess that would work. It's always going to hit the assertion at that point because we haven't scanned ahead a single token yet. Well, I think the above would work, but I'm not recommending it. It's terribly ugly.

Now, I would grant one point about this. The first expansion above, where we're only scanning one token ahead, the second choice CD is unreachable. And it's not too hard to check for something like that at build-time and emit a warning. And, in fact, the legacy tool would warn about this. (Though the warnings it emitted tended to be hard to understand, and actually, IMHO, not even correct...) At some point, I tore out all that code, in large part because it was just written so horribly, and then I figured I'd put some of the warnings back in later (but I never did.) In general, as long as it's single-token lookahead, it's not too hard to warn about such things. Of course, as I point out here regular programming languages do not warn in analogous situations. If you have:

      if (x>0) {
            ...
      }
      else if (x>1) {
           ...
     }

the code in the else-if block is pretty obviously unreachable and nobody considers this a big deal. (Of course, that might be because of the theoretical possibility that some other thread could change x in the nanosecond or so between the first check and the second one. That's obviously... well.... but it is a theoretical possibility, so....)

Still, in terms of this pure FIRST SET sort of stuff, I think there would be some value to warning that a choice is unreachable. Sure, why not? But one wrinkle in all this is that in the whole "context-free grammar" sort of framing, this problem is presented as an "ambiguity", and I honestly don't agree with that framing. IMHO, it is plainly obvious that a tool like this sequentially goes through the choices and takes the first one that matches. That there was a later listed choice that matches, in my mental model, does not constitute an "ambiguity". It's just how the tool works. It's not an "ambiguity", strictly speaking, but quite possibly worth warning about, because it probably is not going to do what the coder really intended... And funny enough, the lexer side is based on the idea that the first match wins. The string "for" matches the pattern for an identifier but also matches the pattern for the keyword "for". But we match the keyword because that is stated earlier and the earlier match wins. Again, nobody thinks that is a big deal. Of course, the string "forget" matches identifier, not the keyword "for" followed by the identifier "get" because we match as much input as we can, a.k.a. greedy matching. And I kinda think everybody beyond neophytes understands all this. (I mean without even studying the question.... it just works about the way one would expect!)

Well, another point about all this is that even if we had the dead code check, writing the C expansion this way:

 C : SCAN 1 {true}# => "c";

would probably cause the check to be short-circuited. You see, once you have the semantic predicate, which is Java code, the logic would just assume that it possibly returns false. It's not even going to do the minimal analysis to see that the prediate always returns true because it is the literal "true"! And once you assume that the next token can be "c" but the lookahead still fails because the Java code condition returns false, then we can't be sure that the second choice is dead code. (Though paradoxically, a human just eyeballing it can see that pretty easily, but since we never scan even the most minimally into any code that is expressed in Java. It's a black box in terms of our analysis.)

But anyway, I think the code snippet you provide is behaving correctly.

adMartem

revusky
Oops, I transcribed my example incorrectly. The first line should be:
A : C ASSERT ~("d") =>|| | CD =>|| ;. I think it results in the same generated logic as the other, however. The SCAN 1 appears to take precedence over the top-level up-to-here and sets lookaheadRemaining to 1 as the first thing in the "C" lookahead. This reflects the issue I had with my existing grammar. I just use the trivial predicates (e.g., "true") as place holders for what is actually a more significant predicate in the actual grammar, using, as you pointed out, the lack of dead code detection as a "feature" when writing test cases.

For now I have worked around the above issue by changing the "SCAN 1" to "SCAN" everywhere I use it. Then the generated parser seems to do what I intended. This construct was not a problem prior to your recent flurry of changes, but it probably worked before by accident, if the current behavior is correct.

I understand your points regarding the original (wrong) example I gave. My mistake highlights a similar error I make in the actual grammar. Namely, forgetting to type the up-to-here marker in cases like this. Luckily that is a one time thing (as I inexorably close in on the CongoCC grammar successfully parsing all of my functional tests). I doubt that it would occur often with anyone developing something from scratch. Most of the typographical errors are caught by the CongoCC grammar and do not become behavioral (such as habitually including a space between "=>" and "||").

revusky

adMartem

adMartem The SCAN 1 appears to take precedence over the top-level up-to-here and sets lookaheadRemaining to 1 as the first thing in the "C" lookahead.

Oh, well, if that's the case, that's another bug! I honestly don't know if that was there before, or these things are a result of me trying to "clean up" all the code and then... Well, stay tuned...

revusky

adMartem

Okay, try again. I think this addresses the problem.

You know, I think we're getting to a certain key point in this iterative process now. (At least for me...) It feels like pattern recognition is kicking in and it dawned on me today that basically all these issues we're running into are glitches in the basic logic of deciding when an expansion uses unlimited or indefinite scanahead, i.e. basically just scanning until it hits an up-to-here (or just to the very end of the expansion) as opposed to lookaheads that are bounded by the number of tokens, i.e. using single-token (FIRST SET) OR a lookahead of some n number of tokens.

So getting this completely in order (and I think we're almost there actually) amounts to consolidating and clarifying the above basic issue, so we don't have numerical lookahead when we should have indefinite lookahead and vice versa. Well, I guess I'm mostly talking to myself in the above, but sometimes that can be useful...

Anyway, it ought to be working correctly now.