How to determine which first sets / follow sets will be used?

Vvsajip · Nov 19, 2022

Is there any way to work out in advance of code generation which first sets and follow sets will be used in the generated code?

revusky · Nov 19, 2022

Well, the approach that it takes is that it just generates everything and then, in a separate pass, it uses DeadCodeEliminator to delete any that aren't used. Granted, that only works for Java code generation. For Python/C#, I guess it just generates everything and it stays there.

(Note that the followSet concept is only used in the (still rather raw) fault-tolerant mode.)

Though, to answer your question, I suppose it is possible to work out in advance which ones are used, though, AFAIK (I could be wrong...) it's only Java that has this "code too large" problem, which was the main motivation for writing DeadCodeEliminator. Granted, generating a bunch of stuff that is unused is not a good thing in principle, but if you don't have the "code too large" problem, then it's not really any high priority to deal with that.

In principle, certainly for C# (and probably for Python too) it's no big deal to implement something like DeadCodeEliminator, but there is currently the issue that the core tool does not have access to a C# parser (or Python parser.) I mean, the JavaCCParser is a superset of JavaParser functionally, so it can parse a Java source file. Or, in other words, to have the equivalent of DeadCodeEliminator for C#/Python, the javacc.jar file would have to include C#/Python parsing functionality -- which is probably finally where we'll end up anyway, so... I mean, ultimately, the parsers in examples/python and examples/csharp, prebuilt, might as well be included in the javacc.jar (and javacc-full.jar).

Vvsajip · Nov 19, 2022

revusky but if you don't have the "code too large" problem, then it's not really any high priority to deal with that.

What would be higher-priority items, in your view? It certainly makes the Python and C# generated code bloated for no particularly good reason. I'm not sure that hitting a compiler limitation would be the only reason to do something to avoid the bloat.

I also don't believe the best approach for the other languages is to do a dead-code elimination step after generating the code - it just seems a sub-optimal way to go about it. (I don't think it's a good way for Java either, but I'm not suggesting that should be changed.)

revusky · Nov 19, 2022

vsajip What would be higher-priority items, in your view?

Well, documentation, generally speaking, but actually, I am inferring that you mean higher-priority in terms of hacking the code. (Obviously we need more and better organized documentation but that is kind of an orthogonal question, I guess.)

I think that getting INJECT working for C#/Python would be quite high priority. On that, I think that what is needed finally is to refactor the JavaCC grammar so that it is not a superset of the Java grammar but it just delegates to a separate Java grammar the parsing of Java elements. So, then we would just use the same approach for handling code actions and injections in Python and C# and any other target language we eventually have.

So, actually, the approach that ANTLR uses for dealing with code action snippets which is just to tokenize them and make no attempt to parse them might actually be right. Of course, in a separate pass, we would parse the various snippets, delegating that to the appropriate language parser, but the JavaCCParser itself would not do that. That's how I think it will work, but maybe only after a Congo rebranding.

It certainly makes the Python and C# generated code bloated for no particularly good reason. I'm not sure that hitting a compiler limitation would be the only reason to do something to avoid the bloat.

Well, I agree. When I say it's not high-priority, I don't mean that we will never address the problem. Realistically, I don't see that many people using the ability to generate parsers in Python/C# before the Congo rebranding. So, maybe the thing is that we can just label certain issues as post-Congo and just make a mental note that we'll address the issue after the Congo rebranding. By the way, even some disposition to make the people obsessed with maven and versioning happy, I don't reject that. But we could agree that we'll just address it after the Congo rebranding.

I also don't believe the best approach for the other languages is to do a dead-code elimination step after generating the code - it just seems a sub-optimal way to go about it. (I don't think it's a good way for Java either, but I'm not suggesting that should be changed.)

I honestly don't understand why you don't like that approach. It has some advantages, you know. For example, you can just include a whole bunch of routines that might conceivably be useful in a template and if they are never called, then they aren't included in the generated class. (Obviously, just for private variables/methods.)

But then, once somebody uses one of those utility routines in a code action, it is not eliminated any more in the dead code elimination step. Maybe you balk at the machinery generating all these variables/methods only to eliminate them in a later step, like it has a funny smell about it or something. But maybe once you get used to the idea, it won't bother you!

adMartem · Nov 19, 2022

revusky
I think I agree on this point. W.r.t. COBOL -> Java code generation, I looked at both alternatives and ended up doing the dead code elimination as a pass (called the Reaper in our case) between AST generation and Java code generation. I chose that for precisely your reason, Jon. To do it after Java generation would have required a considerably greater set of dependencies between code generation and the dead-code (post) eliminator than between the dead-code(pre)-marker and the subsequent Java code generator. FWIW.

revusky · Nov 19, 2022

adMartem

Hmm, Reaper is a nice name. Maybe I'll use that. Actually, what JavaCC 21 does is perhaps even more brutal and bloody-minded than what you are describing. It sounds like your code generates an AST directly and then walks the tree and gets rid of unused elements. What we do is that we generate the Java source code, based on the various templates, with all the unused stuff in there, and then, in a later step, parse all that code, thus building the AST and then, like you, we walk the tree to get rid of the unused stuff, and then we spit it out as source code.

One advantage of this that I didn't mention, actually, is that this does exercise the overall system. This whole API with the nodes and walking the tree and so on, we generate all that API, and then we use it internally to do the "grim reaper" thang.

Oh, by the way, I think I figured out why the "code too large" thing was still rearing its head. Not 100% sure if this was your problem. Look here: https://github.com/javacc21/javacc21/blob/master/src/ftl/java/LookaheadRoutines.java.ftl#L500 and here: https://github.com/javacc21/javacc21/blob/master/src/ftl/java/ParserProductions.java.ftl#L533
and here actually: https://github.com/javacc21/javacc21/blob/master/src/ftl/java/CommonUtils.java.ftl#L37

Basically, I define a variable called USE_FIRST_SET_THRESHOLD and it only uses the XXX_FIRST_SET variable if the number of token types to check is that amount or greater, so the way things are now by default, the reaper will typically reap the XXX_FIRST_SET variables that have fewer than 5 token types. BUT... the threshold can be adjusted manually in the template. If there is great demand for it, I could even make it a settable option in the grammar file. So, anyway, setting that to a higher number could well get rid of some residual code too large problems -- at least if they stem from generating some huge number of first set variables. As for why I set it to 5 by default, well, I dunno, that's just something arbitrary. I don't think it makes much of a performance difference in that range, like if you changed it to 4 or 6 or something.

Vvsajip · Nov 21, 2022

The problem with the reaper approach is that it requires a lot of extra work when considering generated languages other than Java - you'd need to develop dead code elimination software for each generated language. This is quite a lot more than just having a parser for the language (which of course we do for Python and C#). I'm not sure if dead code elimination can be done reliably for dynamic generation languages like Python or JavaScript, where a function or method name can be constructed at runtime and then used to call that function/method (that's a not uncommon Python idiom),

adMartem · Nov 21, 2022

vsajip
For what it's worth, in the analogous COBOL -> Java case, the complexity of the post-Java-generation pattern was that the COBOL language statements, paragraphs, etc. actually compiled determined things like accessors and even entire classes that were generated. To then go back and prune the Java elements would have required retaining the decisions that resulted in those disparate, and often overlapping, elements.

Whether this is at all applicable to this discussion is questionable, but I just wanted to throw it in the mix.
I can certainly see your point w.r.t. non-Java parsers based on the current JavaCC architecture.

revusky · Nov 28, 2022

Sorry to left this unanswered for a week. I was quite absorbed with other things. Fixing the lookahead logic specifically.

vsajip The problem with the reaper approach is that it requires a lot of extra work when considering generated languages other than Java - you'd need to develop dead code elimination software for each generated language

Well, yes, the reaper approach (as applied to the code output in the target language) does require an implementation for each target language. However, that this is a lot of extra work could be an exaggeration. The DeadCodeEliminator.java is not really a lot of code, like 100 lines. And it's not like we're adding a new target language that often!

But, I guess all of this kind of relates to a basic issue. You see, I've always basically assumed that, for each target language that we support -- in principle, as an equal citizen with Java and every other language, we will have our own parser for that language. We can build a tree and so forth. In particular to implement code actions and INJECT, for example... Now, that actually contrasts with the ANTLR approach, where they have "targets" for all kinds of languages -- though I would bet that all but a handful are hardly ever used in practice. But my point is that ANTLR has no "knowledge" internally of the structure of the languages it can generate code for. As best I understand, creating a new "target" for ANTLR is solely a matter of creating a new set of templates. Well, it's true that we could take that approach ourselves, but I've always figured that we could have a higher quality result if the tool has internal smarts about the various languages that it generates code in.

So, what I'm getting at is that IF you assume that each target language has its own grammar/parser and we can actually build AST's for the various constructs, like code blocks and expressions etc. then the incremental work of a having a "reaper" is really not very much. I mean, for example, that C# grammar was a real bear to write. I guess it took me a month to write (about mid-January to mid-February) but I was in full-blown obsessive mode. It was well over a normal man-month of work -- and a not so mythical man-month in any case! But my point is that if I spent several hundred man-hours of work getting the C# grammar in the shape it's in, it is doubtful that C# reaper class would be more than a few hours. (And it would have the side effect of causing me (or you, if you are game for it) to get a better degree of familiarity with the AST that the grammar produces. There is the "eating one's dog food" aspect.) But regardless, the incremental work of a reaper class above and beyond getting the grammar/parser working robustly is really not much.

Now, it's true that if you did this entirely on the JavaCC/Congo AST, i.e. the syntactic tree of the grammar, running over the various nodes and deducing which expansions require a scanahead method and/or first_set variable, you only need to write that once and, in principle, it works for every target language thereafter. That's true, and the work that involves is probably not terribly different in scale from writing the reaper for a given target language.

But, of course, when I first wrote the reaper code for Java, having any other target language seemed like such a pipe dream that it really made fairly little difference. With only one target language, it was bound to be about the same work one way or the other. And I guess I became enamoured of the idea of just generating a whole bunch of code that was possibly not used and then pruning it in a later step.

Now, as to your question: How to determine which first sets / follow sets will be used well, I guess the core logic is not so different from how you determine which variables/methods are actually used in the output code. You run through recursively and build up the list of which productions are actually scanned in a lookahead but of course, those can then in turn contain nested lookaheads that scan into other productions and so on. I guess you just build up the set of expansions that require a lookahead method or whatever and if something is not on that set, then there is no need to build the lookahead method. (Or the first set, if we're only going to scan one token in.)

Well, I guess if this whole issue of the redundant code being generated bothers you a lot, then go ahead and work on it. To tell the truth, for me, the benefit of you doing that would not be so much that we would not be generating all that dead code (which I still think, on balance, is not a big deal at this stage of things) but would more be that you just end up deepening your understanding of the core machinery. So, by all means, go right ahead!

revusky · Dec 2, 2022

Oh, by the way, I patched the "reaper" to get rid of unused imports as well.

See here. (If you're interested...)

adMartem · Jan 11, 2023

Well, one or more of these changes solved my "code too large" problem in fault-tolerant mode! I am excited to get rid of my ad hoc "find the next token that might begin something recognizable" and replace it with re-sync(s).