Implementing code actions/injections etcetera in multiple languages

revusky

We're not completely there yet but I have, over the last couple of weeks, really done what I think is the bulk of the work to getting there -- by "there", I mean, being able to treat non-Java languages as first-class citizens in CongoCC.

It turned out that main thing necessary was to introduce the notion of "raw code blocks", which are blocks of code -- in principle, in any language -- that can be in the grammar. Well, how to explain... let's see...

Consider a code action in the grammar. In Java:

          Foobar {foobarCount++; tweakSomethingOrOther();}

We parse a Foobar and then we run the code in the action. So, really, it's just going to generate something like:

         (code to parse a Foobar)
         foobarCount++;
         tweakSomethingOrOther();

Well, the above is extra simplified, but it's something more or less like that. In principle, if the code in the block is valid, we can just drop it in, no? But that is not what CongoCC does here, or even legacy JavaCC long before it. It actually parses the code! Now, why does it do that? Why not just drop in the code into the appropriate spot. After all, the grammar writer knows what he's doing, right?

Well, yeah, if one never made a mistake when writing the code block, that would be all the same. But of course... there is the issue of error reporting. If we leave out a semicolon, or we write foobarCount+; which is not a valid statement... Or we fail to close the quotes in a string literal.... If we just insert the "dirty" code in the appropriate spot, it does eventually get caught by the Java compiler, but it is at a later point in the build, but perhaps more importantly, the error locations that the Java compiler reports are relative to the generated code, not where the error really occurs in terms of your ongoing work, which is in the grammar file. And there are some other issues, off the top of my head. The parser generator tool, if it just passes "dirty" code through, will frequently be generating a Java source file that cannot be parsed because it's invalid. So, the approach that CongoCC takes, of beautifying the Java code does not quite work, so you would have to open up the unbeautified code in your text editor to see where the problem is... It leads to a much less appealing situation.

And, as I said, we want the non-Java languages to be first-class citizens in the system, so we want to have it working, broadly speaking, like it does for Java. It hits the error and gives you a message with the error location based on where the error occurred in the grammar. If we don't have this working the same way for the non-Java languages, this is still quite half-baked, no?

Now, it so happens that we have the ability to parse Python and CSharp snippets, so what we do is we allow the raw code injection in the grammar file, but at a later stage, it goes through and makes sure it can parse them. Actually, one interesting aspect of this is that we can delay that to a later stage, so if there are multiple such blocks with errors in them, it can go through and try to parse them, and report all the errors together at the end. (Little detail: if there is more than one syntactic error in a given code action, it will only report the first one. But if there are multiple code blocks that contain errors, it reports the first error in each block.) All of that is actually of marginal value perhaps, since the way it works (for Java, I mean) is pretty okay productivity wise. I mean, if you left out a semicolon in a code action, it tells you immediately, with the location. You fix that and you run the tool again and there is an error somewhere else, and you go and fix it and... I mean, just having it stop and report the first error it hits is not particularly bad. At least, I find that to be the case. But we have the option of accumulating the errors and reporting them at one go at the end. See here. (If you're interested...)

But we could also have the older behavior of just stopping on the first error and it could even be configurable quite easily. So we can do it one way or the other anyway...

So, let's see... the way you specify a raw code block in a grammar is by enclosing it with ${ and }$. (I think that's okay, but it could be changed if there is a better idea.) So, anyway, the above code would be alternatively written as:

    Foobar {%foobarCount++; tweakSomethingOrOther();%}

And it's effectively the same, except that if there is a syntactical error to report, it does it at the end, so it can keep parsing past this point and reports the error (along with other similar errors) at the end. In terms of the code that the tool generates, it is the same! Because, basically, it is just inserting the code inside the block into the appropriate point. The machinery behind it is different though. As you can see in more gory detail here, it is just slurping in the code inside the block without making any attempt to validate it, and then going back and doing so later.

And this, of course, leads to the fact that we can have code in other languages in the {%...%} code block and, based on what language we are outputting, we can do exactly what we do with Java code. We parse it in a final step and stick it in there. (Again, we could stick it in there without parsing it, but we want the non-Java languages to be on a par, and besides, we can perfectly well parse the code, so we should do it!)

BUT... there is a bit of an elephant in the room here, because this works for C# certainly, but actually, just inserting the Python code is quite problematic because of how Python's syntax works. If you don't stick the code in with the proper indentation, the resulting source file will not be valid. And that is actually a rather nasty problem, that I believe is now solved.

Now, for one thing, in terms of parsing a block of Python code standalone we can't even do that, by default, unless the code starts off indented at the far left, it is invalid. That's just how Python syntax works. So, if we're generating Python and we have:

     Foobar
     {%
          if someCondition() :
             doThis()
          else : 
             doThat()
     %}

To parse this, we need to send the parser this input:

if someCondition() : 
    doThis()
else : 
    doThat()

So we need to remove the superfluous indentation, which is done here. Rather serendipitously, the JDK API for java.lang.String now has an indent method to use, see here. No big deal, but it's nice just to have this and not have to write these fiddly things oneself.

So, then we can parse the Python block, but then we have a problem when we want to insert it into the file. Well, the solution is that we generate a kind of hacked Python as an intermediate format where we don't have to keep track of the indents. But we explictly put in the dedents (or dedent markers more precisely) so we then munge the above (after moving it far left to parse it) into:

# explicitdedent:on
if someCondition() : 
    doThis()
<-
    doThat()
<-
#explicitdedent:restore

That is what gets inserted into the initial python (or hacked python) source file. And then in the final beautifying step, it reads the above code in and beautifies and removes the <- markers and effectively makes sure that the inserted code (in the final source file) is indented consistently with where it was inserted.

The above is quite a bit of machinery actually, but curiously, the grammar writer (for a grammar that generates Python) should be pretty much oblivious to it. So, he can write:

    Foobar
    {%
         ... block 1...
    %}

and later have:

               Foobaz
               {%
                    ... block 2....
               %}

You see, blocks 1 and 2 above are presumably indented in the grammar file in a way that makes sense for that file. But the machinery is in place so that when the block is inserted into the generated Python source file, it is at the right indent point and the right indent/dedent is inserted.

None of this is necessary for C#, which can work pretty much analogously to Java. We insert the code (after we have checked that we can parse it) and we insert it without any concern for formatting and then we beautify it in a final pass. Dealing with the Python, OTOH, really did require a certain amount of wizardry. (Sorry for the immodesty.)

Well, there are more details I could go into about all this, but suffice it to say that it seems that all of the places where you could previously only put an actual Java code block, you can now use the {%...%} syntax to put in code in whatever language is being output.

There are some other dangling issues, that I need to get at. For now, the older disposition, where, if you wrote a Java code action (I mean just using {...}) it will make an effort to translate it to the actual non-Java output language. It does not do that for {%....%}. For now, if you wanted to write a grammar using the raw code actions that can work for the various output languages, you would need to write:

            #if __java__
                {%some Java code%}
            #elif __python__
                 {%corresponding Python code%}
            #elif __csharp__
                 {%corresponding csharp code%}
             #endif

That is rather verbose and I am thinking I can also allow:

              J{%some Java code%}
              P{%the Python code%}
              C{%the csharp code%}

And the (optional) starting letter would say that we ignore this block unless we are outputting the given language. Something like that. But that is unimplemented at the moment.

Also unimplemented (though that is the next step) is code injections using the non-Java languages. So I need to write versions of CodeInjector that inject the Python/CSharp into the tree where needed. (And then in the final pretty-print pass, the indentation gets sorted out.)

Okay, that is where things are at right now. Once INJECT is working for the non-Java languages (which may be the case by the end of the month or so) we really will have reached a milestone finally with CongoCC, don't you think?

Oh, by the way, in case anybody is wondering why I'm doing this, I think the answer is the same as why Sir Edmund Hillary climbed Everest. Because it was there!

revusky

You know, actually, after writing the above note, I started playing with different notations and the one I like best now is with {% ... %}.

Here is another notational thing. I was thinking about thisProduction, which seems rather verbose. This, by the way, originated in legacy JavaCC (actually the later JJTree add-on, 1997) as jjtThis. I later changed it to CURRENT_NODE. That is maybe a bit ugly admittedly.

I was thinking of changing all that to the much shorter esto, which, as you likely know is simply "this" in Spanish. But then I was wondering whether ESTO is better because it is distinguished as being really a special sort of variable.

One thing that I realized recently is that the (rather clumsy) aliasing that I have in place, where PARSER_CLASS gets munged into the actual parser class name. And there is LEXER_CLASS and NODE_PACKAGE and so on. But all of that doesn't work in a raw code block. From the grammar's point of view, what is in the raw code block is sort of formless schmoo. It does not even lex it.

But, you know, I have been toying with the idea of introducing a limited amount of aliasing in the preprocessor. Maybe something like a simple $foo. So you could have:

          #define foo="schmoo"

And you could write $foo and, well... you know... But that would work on the pre-lexical level when the file is slurped in. So, aside from the aliases we define ourselves, we could have a few preset ones, so that we could have $PARSER_CLASS_NAME and the others. But we'd have the $ in front. By the way, when I say limited aliasing, I mean fairly limited, like the replacement text can't span multiple lines, and, of course, we're not gong have embedding, like #define foo="schmoo$bar" and the $bar inside the string gets replaced. And feature creep like that. Though, actually, with FreeMarker (now called Congo Templates) merged in, I was even toying with the idea of exposing full (or nearly full) FreeMarker functionality, but then fairly quickly dismissed the idea. I think that adding this limited string substitution to the preprocessor could be about right. By the way, I was toying with the idea of letting people write just $foo in FreeMarker instead of ${foo}. And I implemented it, but then commented it out. It seems to mostly be confusing. If you write $foo.bar, what does that mean? You could scan it in visually as ${foo}.bar which could make sense when generating source code. But it could also mean ${foo.bar}. So I finally decided to leave it be. But my idea for the preprocessor is that the only aliasing we have is a single identifier, so we just have $foo. Any opinions about having the aliasing in the preprocessor? Would somebody find that useful?

By the way, I think we'll start calling this ongoing development CongoCC 3.x. We should liberally up the version number. After all, legacy JavaCC is on 7.14, I think!

adMartem

I was wondering if you were considering INJECTed code to be eligible for raw treatment. Now, I see that you are. This all seems to be a huge step in climbing K2, namely the consistent and elegant handling of multiple languages. I guess the ultimate litmus test would be a CongoCC that itself runs in all the target languages.

In COBOL, I used sub-parsers to parse embedded non-COBOL or weirdly complex COBOL by just instantiating another parser, and firing it up with the proper root when a block needs to be parsed in situ. I can then indicate an error exactly where it is within the block, and then move on at the COBOL level. We do the same thing in the dynamic injection method in CongoCC to parse automatically injected properties used by the JTB tree generation code and the "@" assignments. With that approach, we could parse all 3 languages when alternates are provided, and indicate syntax errors even in the language(s) that is not selected for the current parse.

Have you considered putting the language selector inside the brace like this {J% ... %}? I think it would be less likely to have grammar side-effects, and to me it seems more "contained" and descriptive.

Anyway, I've been watching, and even dodging, the recent commits with interest. Let me know if I can help, or there is anything you run across in your changes that I need to fix, particularly in the cardinality template implementations. I'm not at all sure I properly used the best templating techniques there.

Also, if someone else is reading this thread, the thisProduction reference is not exactly equivalent to CURRENT_NODE as I recall. The former always refers to the BNFProduction Node that is being built, but CURRENT_NODE refers to the Node last constructed. Usually it is the production, but in the case of tree node attributes or (in the case of JTB generation) synthetic nodes, it is not. THIS_PRODUCTION may also be used to refer to thisProduction in cases that need to be language agnostic (i.e., will be translated from Java to Python or CSharp).

adMartem

adMartem
And now I see that your latest commit does exactly this.

revusky

adMartem Anyway, I've been watching, and even dodging, the recent commits with interest. Let me know if I can help,

Actually, since you mention this, maybe you could look into (and resolve) this matter of the CURRENT_NODE a.k.a. jjtThis a.ka. thisProduction.... Yes, it must be that the node corresponding to the production is not necessarily the same as the node being built, because that could be the result of an inline tree annotation, no?

But, somehow, due to some wooly minded thinking on my part, I got it into my head that it was all the same thing. So I went through and changed everything willy-nilly to thisProduction, but I guess that didn't break anything, because I just never really used CURRENT_NODE to refer to anything other than that. In fact, in my own work, I only very rarely use inline tree annotations. I almost always use the ones that you put on a BNFProduction. So I guess I was never using CURRENT_NODE for an inline tree annotation.

So, anyway, if you could just go through this and make sure it's working again (it may be broken as of this exact moment, I would be very happy, because I would rather just keep working on what I'm working on.

As I said, I would prefer some terser alternatives to things like CURRENT_NODE and thisProduction. If we used the (much) shorter ESTO (which is of course, just 'this' in Spanish) instead of CURRENT_NODE, what could replace the longer-winded thisProduction.

Well, I'd like these things to be shorter, but if you do it, just choose whatever term you want, I guess.

Also, if someone else is reading this thread, the thisProduction reference is not exactly equivalent to CURRENT_NODE as I recall. The former always refers to the BNFProduction Node that is being built, but CURRENT_NODE refers to the Node last constructed. Usually it is the production, but in the case of tree node attributes or (in the case of JTB generation) synthetic nodes, it is not.

Yes, you must be right, but I was having a mental lapse and thinking it was all the same thing!

adMartem

In my playing around with a PEG (Ford) grammar description in the sandbox, I recently added action blocks. I think I need to generate the new {% ... %} blocks for the extended PEG { ... } now!

adMartem

I'll check that. Actually, after writing here, glanced at the code, and noticed that thisProduction actually was, as you said, the same as CURRENT_NODE. I thought I must have dreamed that it was different, since I couldn't find the code I remembered that peeked at the stack for CURRENT_NODE. I'll check all the known occurrences of it and make sure nothing really needed the stacked node. Funnily, when I changed the generated name for the BNFProduction to be thisProduction, I made the same change you did, but backed it out when my testing of user-specified TNAs failed. I think it would be fine to just require the use of peek if anyone wants to access the old CURRENT_NODE in that case.

adMartem

We could just use the Latin "HOC" (this). In any case I didn't find any uses currently (no pun intended) that seem to refer to the last closed node. They all are referring to the production node. We could use "HIC" (here) to refer to the peek() node if we wanted to keep that capability for when tree node annotation actions need it.

vsajip

I personally think HOC and HIC are cute, but I fear the younger generation might think them a little obscure / non-obvious.

revusky

vsajip

Well, far more young people (certainly in America, but also in Europe, I daresay) have studied Spanish than Latin. So I still tend to think that ESTO is preferable. (Maybe ESTO and ESO, THIS and THAT, but I'm not sure which would be which.)

THen I was thinking that, as long as they are capitalized, THIS and THAT are available, no? It's this in lower case that is a reserved word in Java. So... maybe that is the obvious solution in front of our noses... Could that be?

adMartem

I thought they were cute too, maybe a little too cute, but I like them.

adMartem

Now there's a thought! It is certainly highly unlikely to be something a Java programmer would resent being deprived of. And probably better than TIT and TAT.