A problem (revealed by) ASSERT in lookahead

revusky

Okay, well, I think this is working now. It should be working as I described in the last message.

Well, I've come across some other things that were not really working correctly, but I think the issue you raised above is resolved. (Though please check...)

adMartem

revusky
There seems to be a regression (or a correct sanity check that I fail). This grammar illustrates the error message in question.

FooOrBar : Foo | => Bar;
Foo : SCAN 1 "foo";
Bar : SCAN 1 "bar" ;

MaybeFooThenBar : [ Foo =>|| ] Bar ;

In the real grammars, it seems to object whenever there is a numerical scan nested within an up-to-here in a higher level expansion as far as I can tell.

revusky

adMartem
Oops, that's a mistake I made. I added a sanity check that is well-founded, in principle, like it will complain if you write something like:

SCAN 2 Foo Bar =>|| Baz

because the explicit numerical lookahead of 2 and the up-to-here should not be in the same expansion, but the bug was that it will get the numerical lookahead from initial nonterminal (Foo in this case) as long as Foo starts the expansion. So it should be perfectly permissible to have:

Foo Bar => || BazThat,

and to have:

Foo: SCAN 2 "foo" "bar" "baz";

The effect should just be that the SCAN 2 is ignored because we have an up-to-here in the outer expansion which overrides it. If the up-to-here was not in the calling expansion, i.e. just:

 Foo Bar Baz

then the SCAN 2 in foo would be used, i.e. we check for "foo" followed by "bar" to enter the expansion.

And, of course, if there is no numerical lookahead or an up-to-here anywhere, we just lookahead one token which is to check for "foo".

This is all the result of thinking somewhat hard about how things should work in terms of common-sense principle of least surprise sorts of considerations.There's that, and just economy of expression, DRY. If you have the lookahead in the Foo production above, you don't need to put LOOKAHEAD(2) in front of every expansion that starts with Foo, it's specified in one place. But anyway, the sanity check was expressed incorrectly and should now be fixed. You can update and try it.

adMartem

revusky
Yep, that fixed it. Thanks. Now, however, I get a host of new warnings of the form "The expansion inside this (...)? construct can be matched by the empty string so it is always matched. This may not be your intention." I understand what it is trying to tell me, but in the case of the grammar the non-terminal at the choice point is like this one:

MnemonicNameReference :
    SCAN {isMnemonicName(getNextToken())}# => CobolWord | FAIL
;

Is it not true that the FAIL alternative should suppress the aforementioned warning (since it cannot choose the non-terminal unless there is a CobolWord present? And just now, looking at this, it would seem the FAIL is unnecessary and the semantic predicate should also prevent this from ever matching the empty string, right?

adMartem

adMartem
Actually, I tried it without the FAIL and it didn't give the warning. In one case it was an error message (couldn't match the following expansions) and that too went away. So it seems the FAIL is what is causing the problem. I would guess the message is because with the FAIL alternative causes the non-terminal to always be chosen at a decision point, but it may not consume any tokens, not realizing that the FAIL will never be chosen because of implicit lookahead failure in that case.

revusky

adMartem
WEll, a FAIL doesn't consume any tokens, but this is a spurious warning, I think. I guess I subtly rewrote certain things and now it's giving this spurious warning in these spots. Well, I have to look at this.

Well, just bear with me. We'll just gradually squash all these little bugs, not to worry.

revusky

adMartem
Well, the problem was that it was warning about issues like:

(Foo)*

when Foo can match empty input so you're going to get into an infinite loop. But in a case like:

Foo : "bar" | FAIL;

that is a spurious warning. If you had (Foo)* and Foo was:

  Foo : "bar" | ["baz"] ;

so it potentially matches empty input, then having it inside a repeating (....)* is problematic, because Foo always succeeds and you get an infinite loop...

Well, anyway, try it again. I think it's okay now!

adMartem

revusky
I think I still get the warnings and one error on the following:

...
AdvancingPhrase :
  ( <BEFORE> | <AFTER> ) [ <ADVANCING> ]
  ( <PAGE>
  | MnemonicNameReference =>||
  | ( Identifier =>|| | IntegerConstant | NumericFigurativeConstant) [( <LINE> | <LINES> )]
  | <TO> [ <LINE> ] (Identifier =>|| | IntegerConstant) [ [<ON>] <NEXT> <PAGE>] //TODO: implement this
  )
;
...

The (error) message is Error: /Users/jmb/Development/Local_Repositories/p3cobol/src/main/congocc/p3cobol.ccc:7974:5:This expansion can match the empty string.The following 2 expansions can never be matched.
The error goes away if I remove the FAIL, but the rest of the warnings on other expansions in different productions remain.
I double-checked that I was building from the latest pull from Javacc21 [5d66d704].

adMartem

Here's another one (more exciting) than previous one.
I have the following snippet in the grammar:

...
CombinableCondition :
    SimpleCondition =>|| | AbbreviatedRelationCondition =>|| | <LPARENCHAR> ASSERT ~(AbbreviatedCondition <RPARENCHAR> ArithmeticOperator) AbbreviatedCondition <RPARENCHAR> =>||
;
AbbreviatedRelationCondition :
    (   
            RelationalOperator ArithmeticExpression
        |   [ <NOT> ] RelationalOperator ArithmeticExpression
        |   [ <NOT> ] ArithmeticExpression =>||
//      |   ZERO/ZEROS/ZEROES shadowed by ArithmeticExpression()
        |   SignCondition =>||
    )
;

RelationalOperator :
(
    [ <IS> ] [ <NOT> ]  
        ( 
            SCAN 3 =>   <GREATER> [ <THAN> ] <OR> <EQUAL> [ <TO> ]
        |               <MORETHANOREQUAL>
        |   SCAN 3 =>   <LESS> [ <THAN> ] <OR> <EQUAL> [ <TO> ]
        |               <LESSTHANOREQUAL>
        |               <GREATER> [ <THAN> ]
        |               <MORETHANCHAR>
        |               <LESS> [ <THAN> ]
        |               <LESSTHANCHAR>
        |               (<EQUAL>|<EQUALS>) [ <TO> ]
        |               <EQUALCHAR>[ <TO> ]
        |               <NOTEQUALCHAR>
        |   SCAN {allowJas()}# => <JAS_NE>
        |   SCAN {allowJas()}# => <JAS_EQ>
   )
) =>||
;
...

The input string looks like this : NOT 10 AND 9 AND = 10 ... at the point that CombinableCondition is entered.
What happens is that the AbbreviatedRelationCondition up-to-here scan works fine and passes over the first two choices and succeeds on the third (correct) choice. Then when the AbbreviatedRelationCondition is entered it correctly skips the first choice but (incorrectly) selects the second one based on the first set rather than scanning.

I will try and reduce this to a test case if you need it, but I thought I would let you know right away with this fragment in hopes it is sufficient.

adMartem

adMartem
The funky first two choices are due to the (legal) syntax of "NOT NOT EQUAL 10" in the context of this production. Ugh!

adMartem

adMartem I'm beginning to think this is my (brain's) problem, perhaps masked in earlier CongoCC versions. Is it reasonable to assume that the lookahead will succeed at the same point as the selected non-terminal, or is it the case that I should have resolved the problem with an up-to-here in the 2nd choice of AbbreviatedRelationCondition:
... | [ <NOT> ] RelationalOperator =>|| ArithmeticExpression? I.e., my up-to-here scan was at too high a level.
... ( a little later) ...
Now I'm sure I was wrong-headed when I assumed the behavior I originally described. Short of memoization of the scan to make expansion choices always consistent with lookahead I don't see how it could be implemented the way I assumed. So now the mystery is how it ever worked that way (which it did). I'll have to go back and see what was generated before.

adMartem

adMartem
This is typical of the remaining warnings:

...
WriteCatena :
  RecordName [ <FROM> (Identifier =>|| | Literal) ]
  [ AdvancingPhrase ]
  [ [ At =>|| ] ( <END_OF_PAGE> | <EOP> ) =>|| StatementList ]
  [ <NOT> [ At =>|| ] ( <END_OF_PAGE> | <EOP> ) =>|| StatementList ]
  [ <_INVALID> [ <KEY> ] StatementList ]
  [ <NOT> <_INVALID> [ <KEY> ] StatementList ]
  [ <END_WRITE> ]
;
...
At :
    SCAN {isContextSensitiveWord("at")}# => CobolWord | FAIL
;
...

Warning: /Users/jmb/Development/Local_Repositories/p3cobol/src/main/congocc/p3cobol.ccc:7964:5:The expansion inside this (...)? construct can be matched by the empty string so it is always matched. This may not be your intention. Warning: /Users/jmb/Development/Local_Repositories/p3cobol/src/main/congocc/p3cobol.ccc:7965:11:The expansion inside this (...)? construct can be matched by the empty string so it is always matched. This may not be your intention. occurred at the "At" non-terminal.
When I remove the FAIL the error and warnings all go away.
I guess I probably don't need the up-to-here on the At reference since without the FAIL the SCAN will still be allowed. When I did these I was under the impression that I had to create a choice point in order to add the predicate.

revusky

adMartem

Well, I think there is still a bug in the logic for that warning. I have to look at this more closely. When the final choice in a choice construct is FAIL, then...

Well, not to worry... we'll get this stuff right. In any case, that it's only a warning means that you can disregard it. But the logic of this needs to be fine-tuned.

I do have to say that it is great to have somebody really using all these things in praxis. (Besides the project internally, that is...) Because that really is about the only way to get all this stuff working right.

Well, one aspect of this (that you surely realize) is that the language for expressing the grammar (meta-language to be pretentious...) in Congo/JavaCC21 is really vastly more powerful and expressive than what there is in the original JavaCC. So it is much harder to get everything absolutely right and probably, as a practical matter, the only way to do it is to have noisy, demanding end-users. (Like you.)

revusky

adMartem

I think this is fixed. It was a subtle bug in the sanity check. There is this general problem that the sanity check stuff is meant to catch buggy code, but if the sanity check itself is buggy... I guess that's also a paradox of unit tests and all that. Sure, it's a good idea, you can catch regressions and so on, but if the test itself is buggy....

Though it's maybe a tangent... I myself don't believe in unit tests that much, because I tend to find that if a system is sufficiently complex, the bugs tend to manifest themselves in the conjunction of more than one feature. So unit testing each feature individually can give one a false sense of confidence. And, in any case, I would put more stock in full functional tests than unit tests. We have at least 4 pretty major functional tests of the system, which are the Java, Python, CSharp grammars, and the rebuild/retest of the tool itself, which is written in itself!

Of course, you're hitting these bugs because you are using combinations of things that are not used in the aforementioned functional tests.

revusky

Well, I think the problem you're running into (or maybe it's just one of them) is that I changed (thinking I could get away with it) the way it works as regards using any scanahead specified in a non-terminal.

The way it was before, if you wrote:

   A B C
   |
   D E F

and let's say that B contains an up-to-here, that would be used as long as the preceding expansions were potentially empty, i.e. consumed no tokens. Potentially. So, A could be:

 A: ["foo" | "bar" | "baz"];

which is* potentially* empty. or if the first expansion in the choice above was:

   [A] B C

which amounts to the same thing...

The way it's implemented now, the elements before the nonterminal (say B in this case) must consume no tokens. Since [A] is potentially non-empty, then any up-to-here in B is ignored. But, in principle, you can still have:

     ASSERT {condition1()} {doSomething()} B C 
     |
      ....

And it would use the up-to-here in B, because the elements preceding B do not consume any tokens. (Granted, the code block that is second in the sequence could explicitly call consumeToken() but that's getting entirely too tricky. We do just assume that a Java code block does not consume any input.)

But anyway, the way it was expressed before was that the things preceding it potentially consumed no input. And I surely was thinking about this at some point. I was probably thinking in terms of constructs like:

    Modifiers TypeDefinition

where Modifiers (public, private, static etc.) is potentially empty so maybe the up-to-here is in the TypeDefinition. So there may well be a use-case for this (though none of my internal use was using this).

But finally (very recently) I decided that this was possibly a bit too tricky (not so much to implement as to just document!) and figured that I could get away with changing this so that the nonterminal has to the be the first non-empty sub-expansion in the sequence. I knew this was changing behavior but considered it unlikely that it would affect anybody and also I figured that I could get away with doing this now.

And if you really want to get dirty with the details a bit, this is where this is implemented: https://github.com/javacc21/javacc21/blob/master/src/java/com/javacc/core/NonTerminal.java#L67

So, the current "spec" is that the up-to-here (or SCAN) in a NonTerminal is used if:

 1. The NonTerminal in question is the first non-empty sub-expansion in the sequence
 2. There is no up-to-here (or SCAN) in the enclosing sequence that would have priority.
 3. We're not more than 1 nesting level deep in terms of calling non-terminals or sub-expansions

It could be worth noting that points 1 and 2 are determined at build-time, while point 3 is at run-time, when the parser is actually being run. (Worth noting if you want to develop a conceptual model of how the thing actually works...)

Anyway, the question now is basically:

Could you live with the above semantics?

adMartem

revusky
I think I see what you are saying. Saying it a little differently,
an up-to-here is (recursively) effectively "hoisted" to an enclosing sequence containing its non-terminal if and only if:

The NonTerminal in question is the first non-empty sub-expansion in the enclosing sequence
There is no up-to-here (or SCAN) in the enclosing sequence that would have priority.

Additionally, when processing the grammar with actual input:

The parser is no more than 1 nesting level deep in terms of accepting non-terminals or sub-expansions.

Is that correct? And, if so, can I also assume that any explicit up-to-here in an expansion is always applied at that level of lookahead/acceptance. I.e., the previous rules only apply to "hoisted" up-to-here action, not explicit up-to-here notation.

I can live with that.

The metaphysical problem I have with up-to-here is coming up with a way to think about it while writing productions. But that's my problem, I guess.

Finally, I would assume from this that the correct way to refactor the snippet I gave would be:

CombinableCondition :
    SimpleCondition =>|| | AbbreviatedRelationCondition | <LPARENCHAR>  AbbreviatedCondition <RPARENCHAR> ASSERT ~(ArithmeticOperator) =>||
;
...
AbbreviatedRelationCondition :
    (   
            RelationalOperator ArithmeticExpression
        |   [ <NOT> ] RelationalOperator =>|| ArithmeticExpression
        |   [ <NOT> ] ArithmeticExpression =>||
//      |   ZERO/ZEROS/ZEROES shadowed by ArithmeticExpression()
        |   SignCondition =>||
    )
;
...

i.e., no up-to-here in CombinableCondition (unnecessary), up-to-here on 2nd choice in AbbreviatedRelationCondition (necessary even though RelationalOperator has up-to-here). 1st choice RelationalOperator needs no up-to-here, as it is hoisted to this sequence.

Also, am I correct in assuming that lookahead is independent of acceptance in that the sequence that is checked in lookahead is not guaranteed to be the sequence that is accepted after the choice is taken?

adMartem

revusky
Thanks for your kind words. I know exactly how you feel. I'm glad I tripped over Javacc21 and your humorous narratives. 😃

And now for something completely different...

FNul : [F0] [F1] [F2] [F3] [F4];
Fs : F0 | F1 | F2 | F3 | F4 | FNul;
FsAlt1 : => ( F0 | F1 | F2 | F3 | F4 );
FsAlt2 : ( F0 | F1 | F2 | F3 | F4 ) =>||;
FsAlt3 : F0 =>|| | F1 =>|| | F2 =>|| | F3 =>|| | F4 =>||;

F0 : "one" | "two" | "three" | "four" | FAIL;  
F1 : "one" | "two" | "three" | "four" | => FAIL ASSERT ~("five") | "five";   
F1alt : "one" | "two" | "three" | "four" | => ASSERT ~("five") FAIL | "five"; 
F2 : "eeny" | FAIL | "meany" | "miny" | "moe"; 
F3 : "eeny" | SCAN {false} => FAIL | "meany" | "miny" | "moe";
F4 : "eeny" | SCAN {false}# => FAIL | "meany" | "miny" | "moe";

revusky

adMartem Saying it a little differently,
an up-to-here is (recursively) effectively "hoisted" to an enclosing sequence containing its non-terminal

Well, yeah, if that's more comprehensible to you, given the way your brain works... (everybody is wired a bit differently, I suppose...) Though, actually, looking at what you wrote, I don't quite see the "recursively" part. We're actually not recursing, we're just going one level deep and that's it. Though, reading further, it seems that you understand that perfectly well.

And, as for:

adMartem And, if so, can I also assume that any explicit up-to-here in an expansion is always applied at that level of lookahead/acceptance. I.e., the previous rules only apply to "hoisted" up-to-here action, not explicit up-to-here notation.

Well, yes, this is the way it should work (If I understand what you're saying...) And that's how it will work, but there are currently some issues that need to be addressed, and I guess I'll have to explain that separately.

But, anyway, the specification that is outlined (and I think now is basically implemented correctly) as regards up-to-here in non-terminals, that's not absolutely written in stone yet, I guess. There are a set of things that could be open for discussion, but hopefully, we'll consider it resolved once the Congo rebranding transition is done.

adMartem I can live with that.

Well, I think it's a reasonable, pragmatic approach. Basically, a SCAN or up-to-here only applies in the expansion where it appears and the first nonterminal in a sequence is an exception, and even then, only one nesting level deep.

Well, there are also a few little details wrt parentheses solely used for grouping. If we have:

   (Bar Baz)#BarBaz Bat 
   |
   SomethingElse

then we will respect an up-to-here in Bar. The parentheses around Bar Baz exist for grouping and affect tree-building, for example, but when it comes to up-to-here, it's the same as if it was just: Bar Baz Bat.

Well, I'll write it up, I guess.

adMartem

revusky
Yes, I tend to agree regarding unit tests. Thanks for fixing the spurious warnings and errors. I had them all over my grammar, even after the previous improvement/fix. The reason (they were so plentiful) is that I have several places that use the pattern NonTerminal : SCAN {someCondition}# => CobolWord; which, of course, is not at a decision point, so, after the change was made to more strictly enforce the #1 rule these stopped working. My solution was to turn them into choices like: NonTerminal : SCAN {someCondition}# => CobolWord; | FAIL. This caused lots of the warnings and, in some cases, errors to occur. My little test sample was something I had done to try and see the effect of the interaction of ASSERT, FAIL, and semantic predicates for another purpose, but I noticed it turned into pure warnings and errors when I happened to run it along with some other tests.

Interestingly (or maybe not), in order to get rid of the hard errors I had when the false detection appeared to preclude subsequent choices, I looked at the code and it seemed like the problems stemmed from the fact that FAIL was an EmptyExpansion, and, as such, it returned true to isPossiblyEmpty(). I decided to make a one-line addition:
public boolean isPossiblyEmpty() {return false;} to the Failure INJECTion. It got rid of the errors and warnings, and nothing in my grammar seemed to be broken. Just out of curiosity, is there ever a reason for it to return true?

adMartem

revusky
As I recall, I used "recursively" because I think I noted that if you have:

A : B | "e" "c";
B : D;
D : "e" "f" =>||;

the up-to-here is effectively applied in the first choice of the A production resulting in acceptance of the input "e c" via the second choice. But maybe that has changed since I thought I noticed it. In any case, "recursive" is probably not the way to describe it. It was just what was in my head. I assume

A : B | "e" "c";
B : D "g";
D : "e" "f" =>||;

would fail (to accept "e c") as it would choose B (seeing "e") and then fail to find "f".