[ANN] Syntax Converter Available

revusky

There is now a syntax converter to convert legacy JavaCC (and some things in JavaCC 21) to the approved CongoCC syntax.

You can execute it as:

java -jar javacc-full.jar convert filename

That will spit it out to stdout. More likely you want something more like:

java -jar javacc-full.jar conver filename > outputfilename

The converter converts most of the gnarly things in legacy JavaCC. All those obligatory (in legacy JavaCC, but not in JavaCC 21) empty braces get eliminated. The pointless empty parentheses all over the place get eliminated as well. The legacy LOOKAHEAD construct is replaced by SCAN. Since JavaCC 21 (and Congo of course) do not require any void return type for a production, those get eliminated.

The converter also handles PARSER_CODE_DECLS and TOKEN_MGR_DECLS, converting them to INJECT statements. It converts things that existed in JavaCC 21 but may not exist in Congo, such as

      => Foo Bar Baz

now gets changed to:

      Foo Bar Baz =>||

The conversion of LOOKAHEAD to SCAN is not perfect. It converts:

     LOOKAHEAD(Foo()) Foo()

to:

     SCAN Foo => Foo

but it should really convert it to:

     Foo =>||

Likewise, it converts:

    LOOKAHEAD(Foo() Bar()) Foo() Bar() Baz()

to:

    SCAN Foo Bar => Foo Bar Baz

but it should really convert it to:

     Foo Bar =>|| Baz

(I'll get that going at some point.)

I tried it on some various legacy JavaCC grammars in the wild and it basically seems to work. So, there it is. All feedback is welcome.

adMartem

revusky
Shouldn't the LOOKAHEAD([...,]{...}) convert to SCAN {...}# [...] => in order to retain more consistent function? I had quite a few of these cases where the semantic lookahead was "gating" a dialect-dependent choice. In fact I believe the legacy javacc also evaluates the semantic lookahead after the syntactic lookahead (which I remember I was shocked by when I discovered it), but I doubt that is depended on very often. In any case, maybe a well-placed ASSERT could be used to achieve true functional equivalence if a sematic lookahead is used in addition to syntactic lookahead.

revusky

adMartem

Shouldn't the LOOKAHEAD([...,]{...}) convert to SCAN {...}# [...] => in order to retain more consistent function?

Yes, you're right. 😅

I changed it, but I have to look at this carefully because, to be honest, I'm not 100% sure how it should really be, in general.

In legacy JavaCC, if you write:

  LOOKAHEAD({someCondition()}) Expansion()

if the condition returns true, then it enters the expansion without any actual lookahead. Not even one token, which is the default usually if nothing is specified. But not here. Or, in other words, the above is the same as:

 LOOKAHEAD(0, {someCondition()}) Expansion()

I'm not sure that the way JavaCC21 handles this stuff is so much better. But, if you want the above, as things stand, you would have to put in the zero explicitly, i.e.:

  SCAN 0 {someCondition()} => Expansion

Currently, the logic is that a SCAN with no numerical or syntactic lookahead just scans to the end of the expansion. This is because, at some point earlier, before I even came up with the up-to-here concept, I decided that:

  SCAN Foo Bar Baz

would be equivalent to:

  LOOKAHEAD(Foo() Bar() Baz()) Foo() Bar() Baz()

Or, in other words, a lone SCAN in front of the expansion meant scanning to the end of the expansion. But now that there is a separate way of expressing it, i.e. Foo Bar Baz =>|| I am not sure I want to continue to support the older SCAN Foo Bar Baz which is why the syntax converter now rewrites those things.

I guess, generally speaking, in the move to the Congo branding, there is a one-time opportunity to get this stuff right. Quite possibly the better approach is that:

 SCAN {someCondition()} => Expansion

should mean that it checks the condition AND checks the default single-token lookahead. Or maaaybeeee go back to the notion that if you express a semantic lookahead with no numerical or syntactic lookahead, we just check the condition and have zero tokens of lookahead, which is the legacy JavaCC behavior.

I'm a bit worried about the above causing an indefinite lookahead, because it could be a major gotcha, where people write these very expensive indefinite lookaheads without intending to do so. So that is an open question to throw out there.

Though, all that said, this situation with the legacy JavaCC where semantic lookahead does apply in lookaheads but syntactic lookahead does not -- I wonder if anybody ever tried to justify that on theoretical grounds of some sort? Well, the reason it's like that is because, on their first-pass proof-of-concept implementation of this stuff back around 1996/1997, this was how it was implemented and it was never revisited!

Well, anyway, you're right that there is quite a bit of overlap between ASSERT and using lookaheads, i.e. SCAN. And maybe there is a historical opportunity to revisit this and get things right (or far closer to right) with a rebranding to CongoCC. And, of course, there is the possibility of introducing new constructs that work the way we prefer and using new keywords, like CHECK and/or VERIFY, say. And thus leaving SCAN and ASSERT working as before...

Well, the above includes some rather wooly-minded (at the moment) thoughts...

adMartem

Also, I was curious as to why it converts:

IOControlClause : 
[<COMMACHAR>] =>|+1
(
    RerunClause
  | SameAreaClause
  | MultipleFileClause
 ) [<COMMACHAR>]
;

into

IOControlClause : [<COMMACHAR>] =>|+1
(
    RerunClause
  | SameAreaClause
  | MultipleFileClause
 ) [<COMMACHAR>]
;

adMartem

adMartem
I also noticed things like

;

// ROLLBACK statement production //
 
void RollbackStatement : 
{}
{

get converted to

; RollbackStatement :

Note the completely removed comment, and the other whitespace after the preceding ";". It seems to be related to the removal of the "void" return type. This doesn't happen if there is no return type, as far as I've noticed.

revusky

adMartem

Yeah, well, this stuff is clearly a bit buggy in spots.

Regarding the whitespace, I was wondering recently whether a better approach might be to just not worry much about whitespace and then in a second pass run the thing through a pretty-printer that indents the file according to some conventions.

I mean, I have noticed that a lot of legacy JavaCC grammars indent the code in very strange and inconsistent ways. I guess I decided at some point that the syntax converter wouldn't be "opinionatedd" and just basically try to preserve the original grammar's formatting. But now I'm wavering on that and thinking that maybe the tool should, given the chance, just indent the output consistently. After all, legacy JavaCC grammars are hard enough to read as it is, without the additional problem of them being formatted in a chaotic manner. Besides, a lot of people using the syntax converter would be people who just "inherited" some old JavaCC grammar as part of a codebase and want to clean it up, so they would appreciate the tool reformatting the thing more sensibly.

Of course, all that said, eating the comment is just a clear bug OTOH. But... I would add that, for now, I don't think I'm going to allocate any more effort to this syntax converter. I think it is good enough to be useful and there are so many things for me to turn my attention to now that... Heck, if you (or anybody) wants to muck with this, by all means... But, you know, I could get all obsessive about getting it really perfect, but I have to exert some self-discipline in terms of allocating my time.

Well, and besides that, I still haven't given up on the idea of attracting some collaborators, so arguably, leaving some low-hanging fruit here and there that people could come in and work on, in large part, just to gain familiarity with the overall system -- that could actually be a good thing.

adMartem

And,

void CaptureArchaicComment(Token t) :
{}
{
    {       
        captureArchaicComment(t);
    }
}

converts to

CaptureArchaicComment(Token t) :
    {       
        captureArchaicComment(t);
    }
;

Which now results in:

Exception in thread "main" com.javacc.parser.ParseException: 
Encountered an error at (or somewhere around) /Users/johnbradley/Development/local-repositories/p3cobol/src/main/congocc/p3cobol.ccc:3214:1
Was expecting one of the following:
_FAIL, _UNCACHE_TOKENS, _ACTIVE_TOKENS, _ACTIVATE_TOKENS, _DEACTIVATE_TOKENS, _ENSURE, _ATTEMPT, _LEXICAL_STATE, TRY, STRING_LITERAL, LPAREN, LBRACE, LBRACKET, LT, IDENTIFIER
Found string ";" of type SEMICOLON
	at JavaCCParser.ExpansionUnit(src/javacc/JavaCC.javacc:1777)
	at com.javacc.parser.JavaCCParser.ExpansionUnit(JavaCCParser.java:12823)
	at JavaCCParser.ExpansionSequence(src/javacc/JavaCC.javacc:1535)
	at com.javacc.parser.JavaCCParser.ExpansionSequence(JavaCCParser.java:11974)
	at JavaCCParser.ExpansionChoice(src/javacc/JavaCC.javacc:1445)
	at com.javacc.parser.JavaCCParser.ExpansionChoice(JavaCCParser.java:11779)
	at JavaCCParser.BNFProduction(src/javacc/JavaCC.javacc:1166)
	at com.javacc.parser.JavaCCParser.BNFProduction(JavaCCParser.java:11220)
	at JavaCCParser.Root(src/javacc/JavaCC.javacc:888)
	at com.javacc.parser.JavaCCParser.Root(JavaCCParser.java:10248)

(no JavaCC error before the conversion.

revusky

adMartem

Oops, that was (note the past tense!) a longstanding bug. I mean a bug in the tool itself, not the syntax converter, which does have other bugs, a couple of which you reported above.

There was seemingly a longstanding problem with a production that only had a Java code block in it, like:

Foo: {some java code} ;

That would cause the exception you reported. This is because it actually parses a BNF production in a slightly funky way, which is that it treats the initial code block (assuming there is one) sort of specially. I mean it could (and probably should) just parse something like:

  Foo : {initial java code} Bar Baz ;

as a single expansion sequence that starts with the initial java code block, since a code block is itself an expansion, i.e.

  Foo production
     nested expansion
         code block
         nonterminal Bar
         nonterminal Baz

But it actually parses it as:

  Foo production
      initial code block
      nested expansion
           nonterminal Bar
           nonterminal Baz

This is for reasons that are mostly historical, I guess, in terms of the evolution from legacy JavaCC. It would actually be preferable for it to parse it the first way. I actually tried that but ran into a problem because of some gnarly little problems that I ran into. In fact, I should really try to get it working the other way in CongoCC.

Well, that's already a lot of detail, but the point is that the bug was that if the only thing that the prouduction contained was a single code block, it was parsing that as the initial code block and then expecting a nested expansion and not finding any and that was what the exception message you were getting amounted to. That whole rigmarole about it was expecting one of the following -- which is the first set of an Expansion basically.

Well, currently, it parses a production with a single code block as:

       Foo production
           nested expansion
                 code block

So that should work, I guess. Of course, this got me thinking about why one would ever want to have a production that contains only a sole java code block. (Why not just write it as a Java method and invoke the method normally in code?) Well, there may be some reasons to do that. You could have:

     Foo#(true) : {some java code} ;

Then it always creates a Foo node (with no children) as a result of hitting the production. So it could make some sense in terms of tree building. There might be some other reasons too. In any case, even if it doesn't make much sense to do it, it should be legal. Why shouldn't you be able to do it? So that was a bug, I guess.

adMartem

For my part, the converter did its thing, I fixed the glitches manually, and now I'm happy that the grammar has no traces of the old ways left over. So I certainly endorse your moving on. I just wanted to document the things I noticed. Thanks for doing this, it probably saved me several days at some point to find and change the remaining archaic expressions.