Sorry to left this unanswered for a week. I was quite absorbed with other things. Fixing the lookahead logic specifically.
vsajip The problem with the reaper approach is that it requires a lot of extra work when considering generated languages other than Java - you'd need to develop dead code elimination software for each generated language
Well, yes, the reaper approach (as applied to the code output in the target language) does require an implementation for each target language. However, that this is a lot of extra work could be an exaggeration. The DeadCodeEliminator.java is not really a lot of code, like 100 lines. And it's not like we're adding a new target language that often!
But, I guess all of this kind of relates to a basic issue. You see, I've always basically assumed that, for each target language that we support -- in principle, as an equal citizen with Java and every other language, we will have our own parser for that language. We can build a tree and so forth. In particular to implement code actions and INJECT
, for example... Now, that actually contrasts with the ANTLR approach, where they have "targets" for all kinds of languages -- though I would bet that all but a handful are hardly ever used in practice. But my point is that ANTLR has no "knowledge" internally of the structure of the languages it can generate code for. As best I understand, creating a new "target" for ANTLR is solely a matter of creating a new set of templates. Well, it's true that we could take that approach ourselves, but I've always figured that we could have a higher quality result if the tool has internal smarts about the various languages that it generates code in.
So, what I'm getting at is that IF you assume that each target language has its own grammar/parser and we can actually build AST's for the various constructs, like code blocks and expressions etc. then the incremental work of a having a "reaper" is really not very much. I mean, for example, that C# grammar was a real bear to write. I guess it took me a month to write (about mid-January to mid-February) but I was in full-blown obsessive mode. It was well over a normal man-month of work -- and a not so mythical man-month in any case! But my point is that if I spent several hundred man-hours of work getting the C# grammar in the shape it's in, it is doubtful that C# reaper class would be more than a few hours. (And it would have the side effect of causing me (or you, if you are game for it) to get a better degree of familiarity with the AST that the grammar produces. There is the "eating one's dog food" aspect.) But regardless, the incremental work of a reaper class above and beyond getting the grammar/parser working robustly is really not much.
Now, it's true that if you did this entirely on the JavaCC/Congo AST, i.e. the syntactic tree of the grammar, running over the various nodes and deducing which expansions require a scanahead method and/or first_set variable, you only need to write that once and, in principle, it works for every target language thereafter. That's true, and the work that involves is probably not terribly different in scale from writing the reaper for a given target language.
But, of course, when I first wrote the reaper code for Java, having any other target language seemed like such a pipe dream that it really made fairly little difference. With only one target language, it was bound to be about the same work one way or the other. And I guess I became enamoured of the idea of just generating a whole bunch of code that was possibly not used and then pruning it in a later step.
Now, as to your question: How to determine which first sets / follow sets will be used well, I guess the core logic is not so different from how you determine which variables/methods are actually used in the output code. You run through recursively and build up the list of which productions are actually scanned in a lookahead but of course, those can then in turn contain nested lookaheads that scan into other productions and so on. I guess you just build up the set of expansions that require a lookahead method or whatever and if something is not on that set, then there is no need to build the lookahead method. (Or the first set, if we're only going to scan one token in.)
Well, I guess if this whole issue of the redundant code being generated bothers you a lot, then go ahead and work on it. To tell the truth, for me, the benefit of you doing that would not be so much that we would not be generating all that dead code (which I still think, on balance, is not a big deal at this stage of things) but would more be that you just end up deepening your understanding of the core machinery. So, by all means, go right ahead!