A problem (revealed by) ASSERT in lookahead

adMartem

revusky
I think I see what you are saying. Saying it a little differently,
an up-to-here is (recursively) effectively "hoisted" to an enclosing sequence containing its non-terminal if and only if:

The NonTerminal in question is the first non-empty sub-expansion in the enclosing sequence
There is no up-to-here (or SCAN) in the enclosing sequence that would have priority.

Additionally, when processing the grammar with actual input:

The parser is no more than 1 nesting level deep in terms of accepting non-terminals or sub-expansions.

Is that correct? And, if so, can I also assume that any explicit up-to-here in an expansion is always applied at that level of lookahead/acceptance. I.e., the previous rules only apply to "hoisted" up-to-here action, not explicit up-to-here notation.

I can live with that.

The metaphysical problem I have with up-to-here is coming up with a way to think about it while writing productions. But that's my problem, I guess.

Finally, I would assume from this that the correct way to refactor the snippet I gave would be:

CombinableCondition :
    SimpleCondition =>|| | AbbreviatedRelationCondition | <LPARENCHAR>  AbbreviatedCondition <RPARENCHAR> ASSERT ~(ArithmeticOperator) =>||
;
...
AbbreviatedRelationCondition :
    (   
            RelationalOperator ArithmeticExpression
        |   [ <NOT> ] RelationalOperator =>|| ArithmeticExpression
        |   [ <NOT> ] ArithmeticExpression =>||
//      |   ZERO/ZEROS/ZEROES shadowed by ArithmeticExpression()
        |   SignCondition =>||
    )
;
...

i.e., no up-to-here in CombinableCondition (unnecessary), up-to-here on 2nd choice in AbbreviatedRelationCondition (necessary even though RelationalOperator has up-to-here). 1st choice RelationalOperator needs no up-to-here, as it is hoisted to this sequence.

Also, am I correct in assuming that lookahead is independent of acceptance in that the sequence that is checked in lookahead is not guaranteed to be the sequence that is accepted after the choice is taken?

revusky

adMartem Saying it a little differently,
an up-to-here is (recursively) effectively "hoisted" to an enclosing sequence containing its non-terminal

Well, yeah, if that's more comprehensible to you, given the way your brain works... (everybody is wired a bit differently, I suppose...) Though, actually, looking at what you wrote, I don't quite see the "recursively" part. We're actually not recursing, we're just going one level deep and that's it. Though, reading further, it seems that you understand that perfectly well.

And, as for:

adMartem And, if so, can I also assume that any explicit up-to-here in an expansion is always applied at that level of lookahead/acceptance. I.e., the previous rules only apply to "hoisted" up-to-here action, not explicit up-to-here notation.

Well, yes, this is the way it should work (If I understand what you're saying...) And that's how it will work, but there are currently some issues that need to be addressed, and I guess I'll have to explain that separately.

But, anyway, the specification that is outlined (and I think now is basically implemented correctly) as regards up-to-here in non-terminals, that's not absolutely written in stone yet, I guess. There are a set of things that could be open for discussion, but hopefully, we'll consider it resolved once the Congo rebranding transition is done.

adMartem I can live with that.

Well, I think it's a reasonable, pragmatic approach. Basically, a SCAN or up-to-here only applies in the expansion where it appears and the first nonterminal in a sequence is an exception, and even then, only one nesting level deep.

Well, there are also a few little details wrt parentheses solely used for grouping. If we have:

   (Bar Baz)#BarBaz Bat 
   |
   SomethingElse

then we will respect an up-to-here in Bar. The parentheses around Bar Baz exist for grouping and affect tree-building, for example, but when it comes to up-to-here, it's the same as if it was just: Bar Baz Bat.

Well, I'll write it up, I guess.

adMartem

revusky
Thanks for your kind words. I know exactly how you feel. I'm glad I tripped over Javacc21 and your humorous narratives. 😃

And now for something completely different...

FNul : [F0] [F1] [F2] [F3] [F4];
Fs : F0 | F1 | F2 | F3 | F4 | FNul;
FsAlt1 : => ( F0 | F1 | F2 | F3 | F4 );
FsAlt2 : ( F0 | F1 | F2 | F3 | F4 ) =>||;
FsAlt3 : F0 =>|| | F1 =>|| | F2 =>|| | F3 =>|| | F4 =>||;

F0 : "one" | "two" | "three" | "four" | FAIL;  
F1 : "one" | "two" | "three" | "four" | => FAIL ASSERT ~("five") | "five";   
F1alt : "one" | "two" | "three" | "four" | => ASSERT ~("five") FAIL | "five"; 
F2 : "eeny" | FAIL | "meany" | "miny" | "moe"; 
F3 : "eeny" | SCAN {false} => FAIL | "meany" | "miny" | "moe";
F4 : "eeny" | SCAN {false}# => FAIL | "meany" | "miny" | "moe";

revusky

adMartem

I think this is fixed. It was a subtle bug in the sanity check. There is this general problem that the sanity check stuff is meant to catch buggy code, but if the sanity check itself is buggy... I guess that's also a paradox of unit tests and all that. Sure, it's a good idea, you can catch regressions and so on, but if the test itself is buggy....

Though it's maybe a tangent... I myself don't believe in unit tests that much, because I tend to find that if a system is sufficiently complex, the bugs tend to manifest themselves in the conjunction of more than one feature. So unit testing each feature individually can give one a false sense of confidence. And, in any case, I would put more stock in full functional tests than unit tests. We have at least 4 pretty major functional tests of the system, which are the Java, Python, CSharp grammars, and the rebuild/retest of the tool itself, which is written in itself!

Of course, you're hitting these bugs because you are using combinations of things that are not used in the aforementioned functional tests.

adMartem

revusky
Yes, I tend to agree regarding unit tests. Thanks for fixing the spurious warnings and errors. I had them all over my grammar, even after the previous improvement/fix. The reason (they were so plentiful) is that I have several places that use the pattern NonTerminal : SCAN {someCondition}# => CobolWord; which, of course, is not at a decision point, so, after the change was made to more strictly enforce the #1 rule these stopped working. My solution was to turn them into choices like: NonTerminal : SCAN {someCondition}# => CobolWord; | FAIL. This caused lots of the warnings and, in some cases, errors to occur. My little test sample was something I had done to try and see the effect of the interaction of ASSERT, FAIL, and semantic predicates for another purpose, but I noticed it turned into pure warnings and errors when I happened to run it along with some other tests.

Interestingly (or maybe not), in order to get rid of the hard errors I had when the false detection appeared to preclude subsequent choices, I looked at the code and it seemed like the problems stemmed from the fact that FAIL was an EmptyExpansion, and, as such, it returned true to isPossiblyEmpty(). I decided to make a one-line addition:
public boolean isPossiblyEmpty() {return false;} to the Failure INJECTion. It got rid of the errors and warnings, and nothing in my grammar seemed to be broken. Just out of curiosity, is there ever a reason for it to return true?

adMartem

revusky
As I recall, I used "recursively" because I think I noted that if you have:

A : B | "e" "c";
B : D;
D : "e" "f" =>||;

the up-to-here is effectively applied in the first choice of the A production resulting in acceptance of the input "e c" via the second choice. But maybe that has changed since I thought I noticed it. In any case, "recursive" is probably not the way to describe it. It was just what was in my head. I assume

A : B | "e" "c";
B : D "g";
D : "e" "f" =>||;

would fail (to accept "e c") as it would choose B (seeing "e") and then fail to find "f".

adMartem

Part of my confusion, and now the cause of an issue I've been trying to track down, is the following behavior.

A : C ASSERT ~("d") | CD =>||;
CD : "c" "d";
C : SCAN 1 {true}# => "c";

In this example, the lookahead applied at the 1st choice of A first checks the predicate in C, as expected. But it also sets LookaheadRemaining to 1 (essentially overriding the higher-level up-to-here in A). After checking C, the LookaheadRemaining is 0, and so the scan succeeds and returns without checking the ASSERT condition. Is this expected behavior (a lower-level SCAN n ... will reduce an overall UNLIMITED scan to n), or is this a bug?

revusky

adMartem

Well, it seems to me that the way it is behaving is correct -- at least based on the "spec" I set out above. In:

A : C ASSERT ~("d") | CD =>|| ;

As I see it, the most fundamental rule of JavaCC (21 anyway) is that in the absence of any indication to the contrary, we're just scanning ahead one token to resolve the decision at a choice point. So the first choice in the production A above is just going to scan the "c" and then jump out. So it won't hit the assertion. If you wrote:

  A : C ASSERT ~("d") =>||  | CD =>|| ;

then it would scan to the end and would check the assertion and when that fails, go into the next choice CD. OTOH, if you wrote:

 A: ASSERT {getToken(2).getType() != TokenType.D}# C || CD ;

then I guess that would work. It's always going to hit the assertion at that point because we haven't scanned ahead a single token yet. Well, I think the above would work, but I'm not recommending it. It's terribly ugly.

Now, I would grant one point about this. The first expansion above, where we're only scanning one token ahead, the second choice CD is unreachable. And it's not too hard to check for something like that at build-time and emit a warning. And, in fact, the legacy tool would warn about this. (Though the warnings it emitted tended to be hard to understand, and actually, IMHO, not even correct...) At some point, I tore out all that code, in large part because it was just written so horribly, and then I figured I'd put some of the warnings back in later (but I never did.) In general, as long as it's single-token lookahead, it's not too hard to warn about such things. Of course, as I point out here regular programming languages do not warn in analogous situations. If you have:

      if (x>0) {
            ...
      }
      else if (x>1) {
           ...
     }

the code in the else-if block is pretty obviously unreachable and nobody considers this a big deal. (Of course, that might be because of the theoretical possibility that some other thread could change x in the nanosecond or so between the first check and the second one. That's obviously... well.... but it is a theoretical possibility, so....)

Still, in terms of this pure FIRST SET sort of stuff, I think there would be some value to warning that a choice is unreachable. Sure, why not? But one wrinkle in all this is that in the whole "context-free grammar" sort of framing, this problem is presented as an "ambiguity", and I honestly don't agree with that framing. IMHO, it is plainly obvious that a tool like this sequentially goes through the choices and takes the first one that matches. That there was a later listed choice that matches, in my mental model, does not constitute an "ambiguity". It's just how the tool works. It's not an "ambiguity", strictly speaking, but quite possibly worth warning about, because it probably is not going to do what the coder really intended... And funny enough, the lexer side is based on the idea that the first match wins. The string "for" matches the pattern for an identifier but also matches the pattern for the keyword "for". But we match the keyword because that is stated earlier and the earlier match wins. Again, nobody thinks that is a big deal. Of course, the string "forget" matches identifier, not the keyword "for" followed by the identifier "get" because we match as much input as we can, a.k.a. greedy matching. And I kinda think everybody beyond neophytes understands all this. (I mean without even studying the question.... it just works about the way one would expect!)

Well, another point about all this is that even if we had the dead code check, writing the C expansion this way:

 C : SCAN 1 {true}# => "c";

would probably cause the check to be short-circuited. You see, once you have the semantic predicate, which is Java code, the logic would just assume that it possibly returns false. It's not even going to do the minimal analysis to see that the prediate always returns true because it is the literal "true"! And once you assume that the next token can be "c" but the lookahead still fails because the Java code condition returns false, then we can't be sure that the second choice is dead code. (Though paradoxically, a human just eyeballing it can see that pretty easily, but since we never scan even the most minimally into any code that is expressed in Java. It's a black box in terms of our analysis.)

But anyway, I think the code snippet you provide is behaving correctly.

adMartem

revusky
Oops, I transcribed my example incorrectly. The first line should be:
A : C ASSERT ~("d") =>|| | CD =>|| ;. I think it results in the same generated logic as the other, however. The SCAN 1 appears to take precedence over the top-level up-to-here and sets lookaheadRemaining to 1 as the first thing in the "C" lookahead. This reflects the issue I had with my existing grammar. I just use the trivial predicates (e.g., "true") as place holders for what is actually a more significant predicate in the actual grammar, using, as you pointed out, the lack of dead code detection as a "feature" when writing test cases.

For now I have worked around the above issue by changing the "SCAN 1" to "SCAN" everywhere I use it. Then the generated parser seems to do what I intended. This construct was not a problem prior to your recent flurry of changes, but it probably worked before by accident, if the current behavior is correct.

I understand your points regarding the original (wrong) example I gave. My mistake highlights a similar error I make in the actual grammar. Namely, forgetting to type the up-to-here marker in cases like this. Luckily that is a one time thing (as I inexorably close in on the CongoCC grammar successfully parsing all of my functional tests). I doubt that it would occur often with anyone developing something from scratch. Most of the typographical errors are caught by the CongoCC grammar and do not become behavioral (such as habitually including a space between "=>" and "||").

revusky

adMartem

adMartem The SCAN 1 appears to take precedence over the top-level up-to-here and sets lookaheadRemaining to 1 as the first thing in the "C" lookahead.

Oh, well, if that's the case, that's another bug! I honestly don't know if that was there before, or these things are a result of me trying to "clean up" all the code and then... Well, stay tuned...

revusky

adMartem

Okay, try again. I think this addresses the problem.

You know, I think we're getting to a certain key point in this iterative process now. (At least for me...) It feels like pattern recognition is kicking in and it dawned on me today that basically all these issues we're running into are glitches in the basic logic of deciding when an expansion uses unlimited or indefinite scanahead, i.e. basically just scanning until it hits an up-to-here (or just to the very end of the expansion) as opposed to lookaheads that are bounded by the number of tokens, i.e. using single-token (FIRST SET) OR a lookahead of some n number of tokens.

So getting this completely in order (and I think we're almost there actually) amounts to consolidating and clarifying the above basic issue, so we don't have numerical lookahead when we should have indefinite lookahead and vice versa. Well, I guess I'm mostly talking to myself in the above, but sometimes that can be useful...

Anyway, it ought to be working correctly now.

adMartem

revusky
Indeed, it now works. On to some IF statement edge cases that seem to have cropped up. Likely my problem, not yours.

« Previous Page