Hacking Python Syntax (I think I'm getting it right now.)

revusky

As I described earlier, I finally decided that we absolutely need a kind "hacked" Python syntax to use as an intermediate format for the templates to generate Python code. Then on the next pass, we build a Python AST from the "hacked" Python, we do whatever dead code elimination and stuff, and then spit out the regular Python code.

My first pass on a "hacked" Python syntax to use for code generation was the result of some rather wooly-minded, half-baked thinking on my part. I got it in my head that we needed explicit markers for indent/dedent. But that is actually wrong. One only needs explicit markers for the dedents.

The way Python works (according to Hoyle, or more accurately, according to Guido) is that a code block in Python begins with a colon ':' followed by a newline, followed by indentation. Visually, you know something is a nested code block because it is offset to the right. And you go back to the previous containing code block by dedenting back to the previous horizontal offset. A twist is that you can terminate multiple code blocks (the equivalent of having multiple } delimiters in a language like Java) by dedenting to any previous level of indentation. That is actually quite an elegant solution, but really just does not work well for the problem of generating code. We need to be able to generate whatever Python code snippet (like from a ~~FreeMarker~~ Congo Templates macro) without all the fiddly indentation problem.

This problem, by the way, realliy isn't present when generating a typical language like Java (or CSharp) because we can just generate the code with willy-nilly indentation and then in a later pass, we just run it through a beautifier. But that doesn't quite work for Python, since incorrectly formatted Python code is actually invalid! (In other languages, it is just ugly, but not invalid.)

In any case, it finally dawned on me (I can be slow) that in our "hacked" Python syntax, we don't need any explicit indentation marker, because the a code block is always started by <COLON> is followed by a <NEWLINE>. The reason that the indentation is necessary is so we can end the code block (sometimes multiple code blocks) by dedenting back to the appropriate horizontal offset. So that is where the need to keep track of indentation comes in. But we want to free ourselves of that -- at least when generating Python code from a template.

Or IOW, our design goal here is to allow the code (our "hacked" syntax) to work even if it is not "properly" indented (according to Guido.) This is because when you are generating Python code from templates, it is simply too difficult to get the indentation right. Here is what a macro looks like that generates the Python code to lookahead through a choice (A|B|C). The existing indent has to be physically passed in and each line has to be prefixed with the appropriate indentation.

Here is what the current version of that macro looks like using the newer "hacked" Python syntax. When we are in this "explicitdedentation" syntax, the parser is using the : \n combination to find the indent points, but it does not keep track of horizontal offsets in that mode. In fact, all of the horizontal whitespace is actually not meaningful. It could be indented willy-nilly and still work. The problem is that we now need to expicitly indicate the dedents. I finally switched from <<< to <- as the dedent marker. (I originally chose <<< because I figured we also needed >>>.)

Anyway... that is where things are at.