The New PyWim format!

revusky

Hello, everyone.

I just created a new format for python code that I anticipate we will generate from templates. I dub this the "pywim" format, where "wim" stands for With Indentation Markers. The "wim" part also is like the first syllable of "whimsical", but actually, this is not whimsical at all really.

The basic problem is that what distinguishes Python's syntax, that indentation (and dedentation) is actually meaningful, makes it very very hard to deal with the templating problem, i.e. generating valid Python code from the templates. In a "normal" language like Java, the solution is simple: your template can generate code that is any which a way in terms of indentation, and then, in a separate pass, you just parse that code (since it is valid and thus parseable after all) and you walk the AST and generate properly formatted code. That does not work for Python because our Python parser (like any Python parser) will refuse to parse code that is not properly indented!

So, you see, the problem is that if one is working on a template, one has a natural tendency to indent it so that it is readable as a template, but that is frequently not compatible with the template generating properly indented Python source code. So, the solution I have in mind is pywim, which is an alternative representation for python that works pretty much like any "normal" (LOL) language -- "normal" in the sense that indentation is not meaningful. Actually, in most programming languages, both horizontal whitespace and newlines are not meaningful. In pywim, horizontal whitespace is ignored, but newlines actually work like they do in regular python. In regular python (and pywim) a newline that ends a code line is meaningful, while any newlines that just create superfluous vertical space are not meaningful.

But, anyway, the important point is that in pywim, you have explicit indent/dedent tokens which are >>> and <<< and need to be there to indicate the indentation.

So here is an example to give you a sense of what I'm talking about:

   # Some pywim
def check_intervals(ranges, ch):
>>>    index = bisect.bisect_left(ranges, ch)
 # The following are not indented properly but 
 # it doesn't matter! The parser takes its cue from
 # the indentation _markers_, not the actual indentation
  n = len(ranges)
          if index < n:
>>>  if index % 2 == 0:
>>>            if index < (n - 1):
>>> return ranges[index] <= ch <= ranges[index + 1]
<<< <<<     
elif index > 0:
>>>          return ranges[index - 1] <= ch <= ranges[index]
<<<       return False 
  <<<<<<

What I anticipate in short order is that a template such as lexer.py.ftl (which now lexer.py.ctl actually) will be lexer.pywim.ctl and it will generate pywim code. And then we can parse that into an AST and do stuff we want to do, like reap unused variables and things like that. And then the final thing is to spit out the actual python. We just walk the AST and generate standard python code.

As things stand now, the Java test harness for the Python parser parses an input file as pywim if the extension is .pywim. (For some reason, the non-Java tests are broken now, and I honestly don't know why. None of the aforementioned changes should break anything, but maybe somebody will look into this. (Maybe somebody whose initials are VS.))

And the result of this operation will be that one will be able to work on the python templates without this constant fear that adding or removing a (seemingly) extraneous space (or tab) will break the template!

revusky

revusky Actually, after hacking on this a bit more, it occurs to me that there is no need for a separate .pywim extension. The way it is now implemented is that you can just put:

# pywim:on

or:

 # pywim:off

in a python source and our parser will behave appropriately, turning the "pywim" mode on or off as instructed. The reason for this is that we may want to convert templates gradually. And, of course, when the entire template is converted to generating "pywim" code, then all we need is to have # pywim:on at the very top of the generated file and the parser will treat it that way. The likelihood of somebody having this comment in their file if they are not aware of this feature is, to all intents and purposes zero, like a text file in the wild having a "byte order mark", a.k.a. a BOM (to set the character encoding) at the beginning of a file purely by accident. So, the approach I took in slurping in text files is to always check for the BOM, in case it is there. And again, the chance that it is there by chance is not really worth considering, as far as I can see!

revusky

revusky Actually, I may just delete this thread because, as a result of some further hacking (and also thinking) I realized that my approach was misguided and I tore it up and redid it!

But, that said, it is really quite necessary to do this. This whole existing approach of passing around a an indent integer all over the place makes the thing so fragile and fiddly that it needed to be addressed. So, stay tuned for a new post on "Hacking the Python Syntax"