Preprocessor Redux

revusky

The preprocessor in CongoCC is now a bit over 2 years old. (How time flies!)

I wrote the preprocessor even though I had no urgent need for it at the time. The reason is that I had started work on a C# grammar and, of course, the first thing that jumps out at one in that case is that, unlike any other language I had attacked so far, C# has a preprocessor. So, I saw plenty of scenarios in which CongoCC (JavaCC 21 at the time) would need a preprocessor (or at least be very nice-to-have) so my reasoning was that I would simply implement the preprocessor for C#, but in such a way that I could reuse it for the JavaCC/Congo grammars. I even figured that, if I could implement it in a very general way, then I could just make it available for any grammar writer.

However, none of that really worked out and the C# preprocessor and the one in CongoCC diverged into two separate things. I'll explain why shortly but first a bit of history. (I guess this is becoming a little sideline of mine, explaining the history of such things...)

Now, Java never included a preprocessor because it was considered that the C/C++ preprocessor had such potential for abuse that it was probably better to just say no. Well, I wouldn't be so presumptious as to say that this was the wrong decision. But I would say that there are situations (maybe not so common, but they exist...) when a preprocessor (at least some limited preprocessing capabilities) does come in handy.

So the team behind C# took a more moderate approach. They implemented a preprocessor in C#, but it was much more stripped down than the one from C++. Basically, all you can do is turn on/off regions of a file based on some symbols that you define. So you can do stuff like this:

  #define PRODUCTION
  #if DEBUG || TESTING
       some code 
  #elif PRODUCTION
       other code
  #else
       some other code
  #endif

And not much else really. (In the above example, the only line that is turned on is the other code one.)

So, I wrote some rather straightforward implementation of this. That first implementation just went over the file (not even any tree-building needed) and produced a BitSet of the lines that were turned off.

And that worked well enough. But I only realized something about this a year later, when I finally did get serious about working up the C# grammar:

This is not how the C# preprocessor works!

Once I put together a big test suite of some tens of thousands of C# source files in the wild, this came to my attention. I saw that in C#, you can comment out a preprocessor directive. For example, I would run into things like:

   /*
      #undefine FOOBAR
   */

In the preprocessor logic I had implemented, the above could not work. My understanding of the preprocessor was that it was part of a prelexical step and, therefore, it would not know about comments, which were part of posterior lexical processing.

Now, you could comment out a preprocessing directive with a single-line commenting style, i.e.

 // #undefine FOOBAR

That works because a preprocessor directive has to be the first non-whitespace on a line, so the #undefine FOOBAR would not be a preprocessor directive, but simply part of the // #undefine FOOBAR comment. But with a multiline comment, my understanding was that you couldn't comment out the preprocessor directive via the first of the second comments above.

Actually, not only does the C# preprocessor not really work like the one I implemented, here is another dirty secret:

The C# preprocessor is not really a preprocessor!

It's a misnomer. The C# preprocessor directives are parsed in the same parsing pass as the regular language parsing. So, with the preprocessor I wrote, it is perfectly okay to write:

/*
    #if !TESTING
*/
    #else
    some extra testing code
*/
    #endif

But in C# that is a total cockup! The first preprocessor line is commented out and then the subsequent #else (and #endif) is invalid since it has no corresponding #if. This works okay with the preprocessor I had implemented,since the preprocessing directives are applied as a prelexical step that has no "understanding" of comments (or anything else) in the language being generated. (As I say, I was thinking in terms of having this be generally reusable.)

Isn't that all so very interesting? (I bet you could explain all this in some social setting and be the life of the party!)

So, anyway, the preprocessor I had implemented was not the C# preprocessor. It actually worked in many (probably the majority) of cases, but not always. So, I think it would be in January 2022, after running into the above sorts of cases, I broke down and read the spec and implemented the C# preprocessor. And I left the other preprocessor used in JavaCC21/Congo alone. So that's when the two diverged.

So, now that I've got that history lesson done with, here is where we are now. Congo does not leverage the C# preprocessor for its own internal use. What that means is that it can take its own path of development.

And another thing that I am coming to realize: as we go towards full polygot development, we probably need a more powerful preprocessor than what C# has anyway. So, IOW, it's time to turn our attention to the preprocessor somewhat and think about how it should work. (Not caring at all whether it is very similar to the C# one.)

I think that one thing I'm going to add in fairly short order (unless somebody does it for me) is to add symbol substitution, so that we can do stuff like:

  #if __csharp__
      #define FOOBAR FooBar
  #elif __python__
      #define FOOBAR foo_bar
  #endif

Things like that. (By the way, the current version automatically puts in a __lang__ preprocessor symbol, __java__ or __python__ or __csharp__. I anticipate having a bunch of standard things like this but they'll all start and end with a double-underline.) But, anyway, as you see, I mean we just do these symbol substitutions based on whatever condition, though most commonly which language we're generating.

To be honest, I am still trying to think through all the details. I don't currently anticipate having anything like full macro capabilities along the lines of FreeMarker, probably just simple identifier substitution really. As with any sort of preprocessor setup, there is the whole question of reporting error locations accurately.

Actually, at the moment, the machinery doesn't make any adjustment for unicode escaping, so it does report incorrect column numbers in error messages if you use things like \uA1B2 etc in your file. I have had a mental TODO item of addressing this for a long time, but not got round to it. But really, it's not a high priority. Now that people have unicode fonts, and practically any modern system uses UTF-8 encoding, I don't even know how much people use unicode escaping anyway. And, anyway, as a practical matter, as long as the line number is right, the column could be off a bit and well.... Anyway, that's just me thinking out loud. In the above provisional syntax, it wouldn't even be possible (and this could eventually be deliberate) to do a symbol substitution where the substitution is multiline. Of course, unless it is deliberately prevented, you could always put a multiline string in programmatically, as in:

   preprocessorSymbols.put("foobar", "\nFour score and seven years ago\nOur forefathers...\n");

And then, with that kind of text expansion, there really would be the need to adjust line numbers to reflect what is in the source file.

In other matters, there is something in this vein that is bugging me a bit. With the transition to Congo, we stopped generating all those constructors that took Reader and InputStream. There are only two kinds of constructors generated, those that take a String (or Stringy object, i.e. something implements CharSequence) and those that take a java.nio.file.Path, to read off the file system basically.

What bugs me is that both constructors pass the input through this mungeContent routine that does the tabs/newlines/unicode-escape sorts of translations. I think this makes sense for the constructors that take a Path as a parameter. But, I've been thinking that if you pass a String into the constructor, it probably makes more sense to have the notion that any stuff like that has already been done -- I mean to say, we take it that the application programmer has already normalized newlines, or whatever, assuming that is desired.

Certainly, when you pass in a String to the constructor, we should just assume that any unicode input escaping has already been done. And this is a sequence of characters already, so all the encoding related logic has been taken care of.

Actually, it would be tempting to just have the constructors that take a CharSequence, so we just (implicitly or explicitly) tell the application programmer to figure out encodings and all the rest of it, not our problem. BUT... finally I think we need a constructor that takes in a location on the file system, i.e. java.nio.file.Path. It's just too common a use case. And, I think that, in that case, then we really need to handle these annoying things.

But anyway, I have been aware for some time that probably the constructors that take a String should (at least by default) just pass the chars through. I think so. The problem is that it's somewhat non-backward-compatible to make that change right now. But, then again, there is this general problem of applying this escaping (or actually, unescaping) multiple times. This is only very rarely a problem in practice. And, well, certainly, once you've converted tabs to spaces, applying the same routine a second (superfluous) time doesn't matter because you have no tabs there anyway. So applying the logic multiple times is just inefficient, but as a practical question.... Same with CR-LF to LF and such. Unescaping unicode escaped content a second time is potentially a problem (though it's very once in a blue moon sort of stuff).

Well, I think I'll change it and make some note in the appropriate places that it's a bit backward incompatible. But, it's also more of a theoretical problem than a real practical one. So, the bottom line is that I'll probably make a few changes to the generated API (wrt to these constructors) but it's unlikely to affect hardly anybody in practice.