Code actions and Semantic Lookahead now working for non-Java languages

revusky

I've actually had this working in principle for nearly a year, but for somehow I got distracted and did not put in the necessary energy to polish it off -- the final mile as it were...

As things have stood for a good while, CongoCC makes a pretty good attempt (frequently successful) to translate code snippets from Java to the language actually being generated. And actually, for minimal bits of code this can work quite well. In fact, frequently, there is no need for any translation even.

     SCAN {someFlag} => Foo Bar

There may be no need to translate someFlag, the simple name of a variable. Of course, the problem is that you really want to be able to put an arbitrarily complex expression in there and this can get tricky. While it is true that the expression grammar is quite similar between the various languages we are supporting -- Java, Python, C#, and now Rust -- it is not identical. Features will exist in one language, like interpolated strings (called f-strings) in Python, and one should be able to use the full language in whatever code snippet. That is what it means for the non-Java languages to be first class citizens.

The current solution to this is the raw code feature. You can raw code with the syntax {R% (rust code here) %} or {C% (CSharp code here) %} etc. The raw code block is treated as a single token. It starts with the opening brace {, then there is a letter indicating the language, a %, then the code and it finally closes with with %}.

I have to admit that these raw code snippets are visually uglier than the snippets delimited by simple left and right braces ({ and }) but the reason for the jarring %} to close the block is that we need a reliably distinctive character combination to scan to in order to identify the end of the block. The alternative would maybe involve scanning the block keeping track of the level of brace nesting, but that is quite fiddly. Since we are not even trying (at least in the initial pass) to tokenize what is inside the raw code block, let alone parse it, one would have to realize that the character combination %} cannot occur inside the code block -- for example in a string literal or comment. Well, in short, a raw code snippet is a sort of black box.

So, here are a couple of more comments about using these raw code snippets. An important point is that if we are outputting a given language, Rust, let's say, it simply ignores any code snippets that are identified as being in a language other than Rust. So, if we have a raw code snippet that is specified as being in Java, like:

          {J%
                  blah blah Java code
          %}

That's Java code and we just ignore it if we're outputting something other than Java. If no language is specified, i.e.

        {%
               code in some language
        %}

we're just going to assume that the language of the raw code block is the one we're outputting. I considered making the specification of the language mandatory, but on reflection, the vast majority of projects are not polyglot. You're outputting whatever you're outputting. And requiring people to always write {P% is just annoying. After all, even the plain {% ... %} is ugly enough, so why force people to write the letter for the language.

If we have a straight old Java code block with no % in it, and there is no adjacent block for our language, we're going to fall back on translating it, as before.

So, in the case of an existing grammar oriented towards generating Java code, if we try to generate some other language, well, some of the java code snippets can be auto-translated okay, and if it works, it works, fine. And if it doesn't work, you can stick in a raw code snippet that is specifically for the language you are generating.

Now, when it comes to so-called semantic lookahead, which is some predicate expressed in Java code (or actually, in whatever code now) the grammar is tweaked to allow raw code snippets and you can have more than one, i.e.

          SCAN {condition expressed in Java} {%P condition expressed in Python %} => Foobar

So, consider the above. If you're outputting Python, it uses the second condition, because it is specifically expressed in Python. So that's the one to use. But if you're outputting CSharp or Rust, say, and there is no code snippet specifically for those languages, it falls back to the old strategy of trying to auto-translate the one for Java. If there is a raw code snippet specifically marked as Java, as in:

         SCAN {J%  code in Java %}

then there is no autotranslation. It just ignores this if we are outputting a non-Java language.

Another point that I shall make before closing this message is that the raw code blocks are just treated as black boxes on an initial parse. However, in the sanity check phase where various things are checked, and warnings or errors are possibly emitted, we do check whether the code in the raw blocks is valid or not. It would be criminal not to! After all, we have the ability to parse these code snippets in the various languages. But this is left to the end. This has the effect that you can have errors in multiple code snippets and they are listed. The parsing doesn't halt when you hit an error inside a raw code block. That is a difference with the Java code block written with {...}. A parsing error causes the whole process to halt. That is actually not so terrible, as a practical matter, but the raw code blocks don't work that way. We parse them towards the end of the build and list whatever errors we find. Here is where that happens. https://github.com/congo-cc/congo-parser-generator/blob/master/src/java/org/congocc/core/Grammar.java#L766-L775

Or, in other words, we don't pass through the erroneous code to get caught by the compiler and then you get an error message relative to the generated code. That would be a productivity sink. Any errors caught parsing the code snippets are reported relative to their location in the grammar file.

Well, there are some other little details, loose ends, that I will outline separately. Note that, as of this writing, these raw code snippets can be in semantic lookahead. (What kind of pointy-head academic came up with such a bizarre term?) as well as regular Java code blocks. I still need to implement them inside of ENSURE/ASSERT and FAIL, but that should be easy enough. Then we need to attack INJECT.

revusky

Well, here I am talking to myself. Since writing that message yesterday, I did a bit of further work. Let's see...

One relatively minor point is that, in terms of the syntax of a raw code block being {%...%} or possibly {P%...%}, say, that was based on the notion that the string %}, to all intents and purpose, never occurs in any of the languages we are generating, so it can reliably mark the end of a code snippet. Well, that's not so true, it turns out, though in the case of Java specifically, it is in Javadoc comments that this sequence can occur. And, in any case, that %} can always occur in comments and literal strings. It may not be so common, but it is a definite possibility.

So, anyway, it's not so common, but it does actually occur, so my solution was to allow you to double-up on the delimiters in the case that you really need to have %} in your code snippet, so in that case you can write:

 {{%%... %%}}

or:

{{J%%.... %%}}

And the doubled-up %%}} is surely so rarely occurring anywhere that we are safe on that. (If necessary, we could allow people to triple-up the delimiters!)

Now, in terms of specifying the output language in the raw code block, the letter, it did later dawn on me that it is not that strictly necessary. It could all be handled by the preprocessor. For example, a code block with the 'P' for Python in it, {P%...%} is basically a more terse equivalent to:

          #if __python__
              {% ... %}
          #endif

In either case, the code snippet is ignored if we are not outputting Python basically. (NB. The output language is exposed to the preprocessor via a few preset symbols, like __python__, __java__, __csharp__, and now __rust__ as well.)

While our own example grammars should be polyglot, i.e. generate parsers in the various supported languages, I don't think most software development projects out there in the real world are polyglot. They typically will just care exclusively about outputting code in their language of choice. A Microsoft shop that is very centered around C# only cares about generating C#. And a Java shop only cares about outputting Java code, and so on. For the most part. There may be a few people (besides us!) interested in developing grammars that can generate source code in multiple languages, but I reckon it is rare.

I also did a first pass (it's not quite complete) of supporting raw code blocks with ASSERT/ENSURE and also FAIL.

You can now write:

    ENSURE {% someCondition() %}

The handling of ASSERT is somewhat unfinished. On the Java side, I recently (about a year ago maybe) enhanced the syntax so that you can write:

        ASSERT '{' javaExpression ["," location] [":" errorMessage] '}'

You can see an example of this here. Actually, I'll paste in the code:

  ASSERT {
    permissibleModifiers == null || hasMatch(permissibleModifiers,lastConsumedToken),
    lastConsumedToken : "Modifier " + lastConsumedToken + " not permitted here."
  }

You see, the Java expression it is asserting is true is permissibleModifiers == null || hasMatch(permissibleModifiers,lastConsumedToken) which means that, at least if permissibleModifiers (an EnumSet defining the set of TokenTypes permissible at this stage) is defined, then the modifier we just saw has to be one of those. This allows us to exclude nonsense like public private or final abstract, modifiers combinations that just make no sense! But we express this condition and then there is a comma, and what follows the comma, which is lastConsumedToken is a Node variable that is the location that is used to construct the resulting error message. And then what is after the : is the actual error message we construct.

Well, the above is not currently implemented for raw code snippets. At the moment, all you can write is:

      ASSERT {% condition expressed in whatever language %}

But the location and message after , and : respectively is not implemented. (It will be soon.)

In the case of FAIL, I realized that this is a tricky case. The way it works now is that you can write:

         Option1
         |
          Option2
         |
         FAIL "some message"

But actually, where you have "some message" can be any Java expression. However, you can also have:

        FAIL {some java code}

This is an interesting point, because there is a quite important difference between:

         (Option1 | Option2 | FAIL {throw some exception})

and:

        (Option1 | Option2 | {throw some exception}

The difference relates to when you are in a lookahead. (If you are actually parsing, it's all the same.) If you are in a lookahead, and you scan ahead to {throw some exception} the lookahead is taken to have succeeded. That is because hitting any Java code block (which is really just a black box from the point of view of the parser generator system) is taken to be a success. But if that Java code block has FAIL preceding it, it means that if we scan ahead to this point, the lookahead did not succeed!

So the FAIL statement is an absolutely necessary element, if you think about it, to express certain things...

But anyway, as things stand, we have a bit of a problem with FAIL in association with raw code blocks because we have two sorts of FAIL statements, one which specifies an error message, and the other that specifies a code block. (If one wants to pick nits, there is also a third type (just FAIL alone) that specifies neither of the above!)

But once we allow:

        FAIL {% some code %}

how do we know that the code inside the {% ... %} raw code block is an expression (to be used to construct an error message) or is actually a code block with one or more statements to run. (A weird twist on this is that the above distinction, between expressions and statements barely exists in Rust. Practically everything that we consider a statement in Java or other languages, is also an expression in Rust. So...)

Well, anyway, the solution I have found (it's as good as I can come up with) is that if the code block is meant to be code to be run, then we write:

         FAIL => {% code block %}

and if it is just an expression to construct an error message, then there is no arrow.

Of course there is no ambiguity in the case of the existing Java-centric way it is defined. When you write:

         FAIL { java code block }

or:

          FAIL "you dummy!"

there is no ambiguity. But... you can (now) optionally write:

       FAIL => {java code block}

so that it is consistent with how the raw code block works.

So, anyway, there may be some glitches in all this but it's basically all implemented. (Well, except for specifying the location and error message in an ASSERT, which is still unimplemented, but will be quite soon.)

So, anyway, that's the state of the world at the moment. All comments and ideas are welcome. Thanks for reading to this point!