SCAN question

adMartem

I have the following two productions in my grammar:

Device# : 
  <TAPE> | <INPUT_OUTPUT> | <OUTPUT> | <INPUT> | <DISPLAY> |  DeviceName =>||
;

DeviceName# :
      <MERGE>           // Sort-merge storage device
    | <SORT>            // Sort-merge storage device
    | <SORT_MERGE>      // Sort-merge storage device
    | <TAPE>            // Input-output device or file
    | SCAN 0 {isDeviceName()}# => <COBOL_WORD>
;

The generated code seems to use the first set of the DeviceName non-terminal as its only predicate in the Device production, even though the COBOL_WORD is conditioned by the semantic rule (which is only used as a predicate after DeviceName is chosen). Is this expected behavior and, if so, how should I construct this to do what I want? I guess I could move the semantic rule to the Device level, but that seems kind of ugly since it puts the DeviceName-specific logic in the wrong place.

revusky

adMartem

Hi. Sorry to be a bit slow getting back to you on this. The last week or so has been rather unusual though. On Monday, I flew back to Spain from England, and between my last few days in England and then getting home and having other things to deal with...

Well, anyway, it seems like what you are running into is a bug. Basically, here:

Device# : 
   <TAPE> | <INPUT_OUTPUT> | <OUTPUT> | <INPUT> | <DISPLAY> |  DeviceName =>||
;

It is not generating a full scanahead predicate for the DeviceName=>|| expansion and clearly it should.

Basically, one could say that if you're at a choice point, the logic of whether to go into that sub-expansion is either:

Just LL(1), i.e. just checking if the next token is in the expansion's "first set"
An actual generated predicate routine that is more sophisticated than that.

The problem is that it's just using the first type of logic when it should be doing the second one. It's reasoning that the expansion in question consumes exactly one token so it should use the first set logic, even though there is semantic predicate sort of logic inside the expansion that should be taken into account.

Actually, properly considered, this is a rather glaring bug. And actually, I was just playing with this a little and noticed that, for example, if you rewrite the DeviceName production like:

DeviceName# :
     <MERGE>           // Sort-merge storage device
    | <SORT>            // Sort-merge storage device
    | <SORT_MERGE>      // Sort-merge storage device
    | <TAPE>            // Input-output device or file
    | SCAN 0 {isDeviceName()}# => <COBOL_WORD> [<DISPLAY>] // Added an optional token
 ;

then since the DeviceName production potentially consumes 2 tokens, it generates the appropriate code. (You could try it.) The problem is that, as it was written before, always consuming exactly one token, it (incorrectly) always just used the first set logic.

There was another bug very similar to this reported a month ago, and it was based on a similar problem. See here: https://github.com/javacc21/javacc21/issues/188#issuecomment-1138926324

But that was resolved somehow but there is still this problem with what you are writing here. Well, I'll look into this and fix it shortly. Stay tuned...

adMartem

BTW, I tried SCAN {isDeviceName()}# => <COBOL_WORD> and SCAN 0 {isDeviceName()}# => <COBOL_WORD> | FAIL but they did the same thing.

adMartem

👍️

revusky

adMartem

Give it a try now. I think it's okay.

adMartem

revusky I thought it worked when I tried it, but then when I backed out my workaround in the big grammar, it still seemed to fail. I think it can be reduced it to a test case that reflects the usage in my grammar, to wit:

FooBarMaybeBaz :
    "foo"
   ( NT_Bar =>|| [ NotBaz =>|| | "baz" ] | "foo" "bar" [ "baz" ] {System.err.println("shouldn't get here with 'foobar(baz)?'!");} )
;
NotBaz :     
    ( SCAN {false}# => "xyzzy" | SCAN {false}# => "baz" ) {System.err.println("shouldn't get here with 'foobarbaz'!");}
;
NT_Bar :
    NotBar =>|| | "bar"
;
NotBar : 
    ( SCAN {false}# => "bar" | FAIL ) {System.err.println("shouldn't get here with 'foobar(baz)?'!");} 
;

I notice that it seems to generate the correct lookahead for NotBar (and NotBaz), but loses track of the predicate in the NT_Bar expansion. Unless I have lost my mind, which is certainly possible after looking at this for a few hours ;-)

This is using e0dfb24e.

P.S. this is the actual production above the Device production previously that continued to fail:

AssignClause :
  <ASSIGN> [ <TO> ] [ <DYNAMIC> | <EXTERNAL > ] 
  ( 
    Device =>|| [ [ <USING> ] ( QualifiedDataName | Literal ) ] 
    |
    [ <USING> ] ( QualifiedDataName | Literal ) =>||
    |
    (
        (
           ( SCAN 0 { isContextSensitiveWord("disk", "disc") }# => <COBOL_WORD> )
           |
           ( <LINE> <ADVANCING> | [ <MULTIPLE> ] ( < REEL> | <UNIT> ) )
        )
        [ <FILE> ] ( QualifiedDataName | Literal )
    ) =>||
  )
;

revusky

adMartem I notice that it seems to generate the correct lookahead for NotBar (and NotBaz), but loses track of the predicate in the NT_Bar expansion. Unless I have lost my mind, which is certainly possible after looking at this for a few hours ;-)

No, I don't think you've lost your mind! You're right that it wasn't generating the predicate for NT_Bar expansion, when it should. Update and try again and report back.

adMartem

revusky
It fixed the test case 😀 , but not the real grammar 🙁. Here is another test case that more closely mimics the actual grammar and reveals (another?) problem:

AssignClause :
  "assign" [ "to" ] [ "dynamic" | "external" ] 
  ( 
    Device =>|| 
    |
    [ "using" ] "blah"
  )
;

Device : 
  "tape" | "display" | DeviceName =>||
;

DeviceName :
      "merge"
    | SCAN {false}# => "plugh"
;

You will notice the scan for the Device non-terminal in AssignClause tests only the first-set without the predicate, however, the DeviceName non-terminal in Device scan checks the predicate as expected.

adMartem

adMartem
The previous fix did fix several other actual problems I had not yet fully diagnosed 😄 !

revusky

adMartem

Okay... I think this should be working now.

Now, since I suspect (not sure...) that you might be interested in (gradually learning how the code works, I'm going to explain what is going on here. Regardless, the exercise of explaining it is useful to me and also there might be some people lurking who are interested in this. And actually, I think Vinay could be interested for sure. But anyway, here goes...

Pretty much the entirety of the problems you've been hitting have to do with a certain optimization that was not entirely correct. Basically, the code generation machinery would try pretty hard to replace certain scanahead code with single-token-lookahead logic. Precisely where this happens is here:

https://github.com/javacc21/javacc21/blob/master/src/ftl/java/LookaheadRoutines.java.ftl#L323

and here also:

https://github.com/javacc21/javacc21/blob/master/src/ftl/java/LookaheadRoutines.java.ftl#L481

But mostly it's the first line. So you see this code in the template, like line 323, which basically if expansion.singleToken is true, it short-circuits all the cases below that and just generates single-token logic. And, as you can surmise, that was the key point which was causing the issues you were reporting.

Now, the expansion.singleToken actually calls this method in the Java code:

https://github.com/javacc21/javacc21/blob/master/src/java/com/javacc/core/Expansion.java#L457

Basically, I cut the Gordian knot by having this just return false, i.e. disabling the optimization. I was actually surprised that this made almost no difference to performance -- at least on very significant grammars that I ran some tests with, Java and Python specifically. In fact, the optimization was effectively (to my surprise) almost entirely a space optimization. It just generated smaller code, but not faster. So, for example, the JavaParser.java generated by the Java grammar is about 10% larger with this optimization turned off.

I may gradually turn on the optimization again, but in a very careful manner!

adMartem

revusky
As expected, it works now. Thanks for the explanation, I did find it interesting and informative.

I have also found counter-intuitive Java performance effects similar to that which you encountered. For example, I once went on a mission to "improve" some generated Java by eliminating sequences of tiny little methods that eventually reached a method that actually did something by in-lining the low-level functional code in the original top-level invoker. Unexpectedly, after a lot of work, I discovered that it slowed down everything significantly. In that case, the issue was that the tiny methods were quickly denatured by Hotspot and inlined in native code, but when I moved the functional code to the actual point of usage it pushed those methods beyond Hotspot's JIT compilation threshold, so nothing was inlined or compiled to native. Apparently Hotspot loves tiny methods.