We're not completely there yet but I have, over the last couple of weeks, really done what I think is the bulk of the work to getting there -- by "there", I mean, being able to treat non-Java languages as first-class citizens in CongoCC.
It turned out that main thing necessary was to introduce the notion of "raw code blocks", which are blocks of code -- in principle, in any language -- that can be in the grammar. Well, how to explain... let's see...
Consider a code action in the grammar. In Java:
Foobar {foobarCount++; tweakSomethingOrOther();}
We parse a Foobar
and then we run the code in the action. So, really, it's just going to generate something like:
(code to parse a Foobar)
foobarCount++;
tweakSomethingOrOther();
Well, the above is extra simplified, but it's something more or less like that. In principle, if the code in the block is valid, we can just drop it in, no? But that is not what CongoCC does here, or even legacy JavaCC long before it. It actually parses the code! Now, why does it do that? Why not just drop in the code into the appropriate spot. After all, the grammar writer knows what he's doing, right?
Well, yeah, if one never made a mistake when writing the code block, that would be all the same. But of course... there is the issue of error reporting. If we leave out a semicolon, or we write foobarCount+;
which is not a valid statement... Or we fail to close the quotes in a string literal.... If we just insert the "dirty" code in the appropriate spot, it does eventually get caught by the Java compiler, but it is at a later point in the build, but perhaps more importantly, the error locations that the Java compiler reports are relative to the generated code, not where the error really occurs in terms of your ongoing work, which is in the grammar file. And there are some other issues, off the top of my head. The parser generator tool, if it just passes "dirty" code through, will frequently be generating a Java source file that cannot be parsed because it's invalid. So, the approach that CongoCC takes, of beautifying the Java code does not quite work, so you would have to open up the unbeautified code in your text editor to see where the problem is... It leads to a much less appealing situation.
And, as I said, we want the non-Java languages to be first-class citizens in the system, so we want to have it working, broadly speaking, like it does for Java. It hits the error and gives you a message with the error location based on where the error occurred in the grammar. If we don't have this working the same way for the non-Java languages, this is still quite half-baked, no?
Now, it so happens that we have the ability to parse Python and CSharp snippets, so what we do is we allow the raw code injection in the grammar file, but at a later stage, it goes through and makes sure it can parse them. Actually, one interesting aspect of this is that we can delay that to a later stage, so if there are multiple such blocks with errors in them, it can go through and try to parse them, and report all the errors together at the end. (Little detail: if there is more than one syntactic error in a given code action, it will only report the first one. But if there are multiple code blocks that contain errors, it reports the first error in each block.) All of that is actually of marginal value perhaps, since the way it works (for Java, I mean) is pretty okay productivity wise. I mean, if you left out a semicolon in a code action, it tells you immediately, with the location. You fix that and you run the tool again and there is an error somewhere else, and you go and fix it and... I mean, just having it stop and report the first error it hits is not particularly bad. At least, I find that to be the case. But we have the option of accumulating the errors and reporting them at one go at the end. See here. (If you're interested...)
But we could also have the older behavior of just stopping on the first error and it could even be configurable quite easily. So we can do it one way or the other anyway...
So, let's see... the way you specify a raw code block in a grammar is by enclosing it with ${
and }$
. (I think that's okay, but it could be changed if there is a better idea.) So, anyway, the above code would be alternatively written as:
Foobar {%foobarCount++; tweakSomethingOrOther();%}
And it's effectively the same, except that if there is a syntactical error to report, it does it at the end, so it can keep parsing past this point and reports the error (along with other similar errors) at the end. In terms of the code that the tool generates, it is the same! Because, basically, it is just inserting the code inside the block into the appropriate point. The machinery behind it is different though. As you can see in more gory detail here, it is just slurping in the code inside the block without making any attempt to validate it, and then going back and doing so later.
And this, of course, leads to the fact that we can have code in other languages in the {%...%}
code block and, based on what language we are outputting, we can do exactly what we do with Java code. We parse it in a final step and stick it in there. (Again, we could stick it in there without parsing it, but we want the non-Java languages to be on a par, and besides, we can perfectly well parse the code, so we should do it!)
BUT... there is a bit of an elephant in the room here, because this works for C# certainly, but actually, just inserting the Python code is quite problematic because of how Python's syntax works. If you don't stick the code in with the proper indentation, the resulting source file will not be valid. And that is actually a rather nasty problem, that I believe is now solved.
Now, for one thing, in terms of parsing a block of Python code standalone we can't even do that, by default, unless the code starts off indented at the far left, it is invalid. That's just how Python syntax works. So, if we're generating Python and we have:
Foobar
{%
if someCondition() :
doThis()
else :
doThat()
%}
To parse this, we need to send the parser this input:
if someCondition() :
doThis()
else :
doThat()
So we need to remove the superfluous indentation, which is done here. Rather serendipitously, the JDK API for java.lang.String
now has an indent
method to use, see here. No big deal, but it's nice just to have this and not have to write these fiddly things oneself.
So, then we can parse the Python block, but then we have a problem when we want to insert it into the file. Well, the solution is that we generate a kind of hacked Python as an intermediate format where we don't have to keep track of the indents. But we explictly put in the dedents (or dedent markers more precisely) so we then munge the above (after moving it far left to parse it) into:
# explicitdedent:on
if someCondition() :
doThis()
<-
doThat()
<-
#explicitdedent:restore
That is what gets inserted into the initial python (or hacked python) source file. And then in the final beautifying step, it reads the above code in and beautifies and removes the <-
markers and effectively makes sure that the inserted code (in the final source file) is indented consistently with where it was inserted.
The above is quite a bit of machinery actually, but curiously, the grammar writer (for a grammar that generates Python) should be pretty much oblivious to it. So, he can write:
Foobar
{%
... block 1...
%}
and later have:
Foobaz
{%
... block 2....
%}
You see, blocks 1 and 2 above are presumably indented in the grammar file in a way that makes sense for that file. But the machinery is in place so that when the block is inserted into the generated Python source file, it is at the right indent point and the right indent/dedent is inserted.
None of this is necessary for C#, which can work pretty much analogously to Java. We insert the code (after we have checked that we can parse it) and we insert it without any concern for formatting and then we beautify it in a final pass. Dealing with the Python, OTOH, really did require a certain amount of wizardry. (Sorry for the immodesty.)
Well, there are more details I could go into about all this, but suffice it to say that it seems that all of the places where you could previously only put an actual Java code block, you can now use the {%...%}
syntax to put in code in whatever language is being output.
There are some other dangling issues, that I need to get at. For now, the older disposition, where, if you wrote a Java code action (I mean just using {...}
) it will make an effort to translate it to the actual non-Java output language. It does not do that for {%....%}
. For now, if you wanted to write a grammar using the raw code actions that can work for the various output languages, you would need to write:
#if __java__
{%some Java code%}
#elif __python__
{%corresponding Python code%}
#elif __csharp__
{%corresponding csharp code%}
#endif
That is rather verbose and I am thinking I can also allow:
J{%some Java code%}
P{%the Python code%}
C{%the csharp code%}
And the (optional) starting letter would say that we ignore this block unless we are outputting the given language. Something like that. But that is unimplemented at the moment.
Also unimplemented (though that is the next step) is code injections using the non-Java languages. So I need to write versions of CodeInjector
that inject the Python/CSharp into the tree where needed. (And then in the final pretty-print pass, the indentation gets sorted out.)
Okay, that is where things are at right now. Once INJECT
is working for the non-Java languages (which may be the case by the end of the month or so) we really will have reached a milestone finally with CongoCC, don't you think?
Oh, by the way, in case anybody is wondering why I'm doing this, I think the answer is the same as why Sir Edmund Hillary climbed Everest. Because it was there!