New Feature: Contextual (a.k.a. soft) Keywords

revusky

adMartem
The soft keywords feature is working in "master" now. You can (optionally) declare them like other tokens, as in:

      CONTEXTUAL #Keyword : 
           <YIELD : "yield">
           |
          <RECORD : "record">
      ;

Etc. You can declare them as case-insensitive, say, even if your grammar overall is case-sensitive. In the above, the #Keyword annotation means that the tokens are instantiated as instances of that class.

And you can also just "declare" them on the spot, like here or here or here.

And it's working for Python and C# as well! It all seems fine. There may be some glitches here and there. Probably there are.

As I said, the "soft" keyword has to be matched as some more general pattern or it just doesn't work, which is why non-sealed could not be replaced with a contextual keyword, since the Java tokenizer will match that as "non" followed by "-" followed by "sealed". But if you do have a more general pattern in your grammar that would match that, then it should work fine.

I think I'm going to delete the "slave branch" pretty soon. You don't have any objection, do you?

adMartem

revusky No objection here. Can't wait to use this new feature. It will be much more scrutable than the way I do this now (in my grammars). Thanks!

adMartem

Would you like for me to try and synchronize C# and Python with the cardinality changes now, or should I wait? I don't think there is anything in them that my inexperience with Python or C# would render insurmountable, but I was waiting for a period of quiescence before attempting it [They might take me awhile]. It mostly affects the ParserProductions and LookaheadRoutines templates, as you doubtless know.

BTW, unless you object, I would like to make a change to generate a TokenType as MY_TOKEN("my_token", false, true) in the case of a non-contextual but defined as [IGNORE_CASE]. It is needed if using the TokenType to drive IDE syntax coloring, and was a change, more or less, that I had planned on making before you implemented a superset of it. As here.

revusky

adMartem
Well, I guess this would be a good time to do it. I'm probably going to take a bit of a break on this. I really should. I took a bike ride yesterday for the first time in a month or so and I was just so unbelievably tired afterwards. I need to get back in shape.

By the way, have you noticed the recent changes in error reporting? That is another thing where the Python and C# are not up-to-date. I was thinking (or hoping) that Vinay would take a look at that. But if you want to, it would be quite welcome.

We really need to do something to get the word out, in particular to the Python and Dotnet communities. What we have really does not work so badly really. A bit more of a push with the raw code injections and we could get them in synch feature-wise. There is the issue that the parsers are MUCH slower. I expected that for Python but it is surprising how much slower the CSharp parsers are. But if we could get somebody involved who is interested in profiling and figuring out how to improve things.

revusky

adMartem BTW, unless you object, I would like to make a change to generate a TokenType as MY_TOKEN("my_token", false, true) in the case of a non-contextual but defined as [IGNORE_CASE]. It is needed if using the TokenType to drive IDE syntax coloring, and was a change, more or less, that I had planned on making before you implemented a superset of it. As here.

Oh, regarding this... well, all of this is in a pretty fluid state so if you have some ideas for doing it a bit differently, it's probably all okay. In the other languages, it turned out that I had to do things quite differently because the enums don't have the same functionality. In C#, I guess they are more like syntactic sugar around integer constants, while in Java, the enums really are objects and you can give them fields and constructors and have them implement interfaces... And, of course, EnumSet and EnumMap are very super-efficient implementations of Set and Map where the enum objects act as the keys. It really is something that the Java people did pretty well IMHO.

vsajip

revusky I was thinking (or hoping) that Vinay would take a look at that

I would have liked to, but I have family commitments which will take up a lot of my time until around mid-September, Happy for John to have a crack at it, and I can provide any pointers if at all needed. When I have time I might be able to look at the C# profiling, too.

adMartem

vsajip
Thanks Vinay. Right now I will try and get the cardinality changes made for Python and C# so they don't drift too far apart. I'll definitely appreciate pointers when I get stuck (as I likely will).

revusky

vsajip
Hi Vinay. Nice to hear from you. I understand that you just don't have the free time. But... if you have just a moment to put into this, could you get the build of the C# parser in C# working? It must be a trivial issue. Maybe even just a few minutes. Like if you run:

dotnet build cs-csharpparser/org.parsers.csharp.csproj

from the examples/csharp directory, you get a message like this:

home/revusky/projects/congo/examples/csharp/cs-csharpparser/Lexer.cs(12,46): error CS0234:
The type or namespace name 'ppline' does not exist in the namespace 'org.parsers.csharp' 
(are you missing an assembly reference?) [/home/revusky/projects/congo/examples/csharp/cs- 
csharpparser/org.parsers.csharp.csproj::TargetFramework=net6.0]
/home/revusky/projects/congo/examples/csharp/cs-csharpparser/Lexer.cs(11,54): error CS0234: 
The type or namespace name 'ppline' does not exist in the namespace 'org.parsers.csharp' 
(are you missing an assembly reference?) [/home/revusky/projects/congo/examples/csharp/cs- 
csharpparser/org.parsers.csharp.csproj::TargetFramework=net6.0]

The problem is obviously that the CSharp parser is using this other little parser ppline that it cannot find. What does one have to do to get this working? I figure it's just a minute for you. And this says a lot about my total ignorance of dotnet!

I do hope you can find a moment to get this working. In any case, I guess we'll be back in touch in mid-September! I think you'll find there will be a lot of exciting new stuff by that point.

vsajip

revusky I figure it's just a minute for you.

Perhaps it's not quite that simple. I had a very quick look, and at the top of the generated file Parser.cs for the line directive parser, I find:

using org.parsers.csharp.ppline;
using System;
using System.IO;
if (args.Length == 0) {
    Console.WriteLine("Usage: <program> " + "files or directories with files to parse.");
}
long start = System.DateTime.Now.Ticks;
int successes = 0;
int failures = 0;
foreach (string arg in args) {
    if (arg.EndsWith(".default") && File.Exists(arg)) {
        try {
            Parser p = new Parser(arg);
            p.ParseModule();
            Console.WriteLine("parsed " + arg + " successfully.");
            successes++;
        }
        catch(ParseException e) {
            Console.WriteLine("Problem parsing file: " + arg);
            Console.WriteLine(e);
            failures++;
        }
    }
    else foreach (var f in Directory.EnumerateFiles(arg, "*.default", SearchOption.AllDirectories)) {
        try {
            Parser p = new Parser(f);
            p.ParseModule();
            Console.WriteLine("parsed " + f + " successfully.");
            successes++;
        }
        catch(ParseException e) {
            Console.WriteLine("Problem parsing file: " + f);
            Console.WriteLine(e.ToString());
            failures++;
        }
    }
}
Console.WriteLine("Successfully parsed " + successes + " files.");
Console.WriteLine("Failed on: " + failures + " files.");
long duration = (System.DateTime.Now.Ticks - start) / 10000;
Console.WriteLine("Duration: " + duration + " milliseconds.");
namespace org.parsers.csharp.ppline {
  // classes which I would expect to be there
}

I haven't looked into where that preamble (before the namespace declaration) comes from, but before your big refactoring, it wasn't there, and the Parser.cs file started with the namespace declaration. So it's possibly a transpilation issue, or something else,. but it doesn't look as if the preprocessor parser will compile, and it is needed by the C# parser for conditional compilation etc.

adMartem

revusky
I wondered how you made the changes to Python for this, and now I see what you did. Probably those indicator methods should be somehow associated with the TokenType or at least in the Lexer so that the Parser is not needed to get to them. But then I know so little about Python that I wouldn't have anything useful to say at this point. But I agree, I've thought for many years that the Java Enum is one of their true achievements. Unlike the 16-bit architecture and tacked on generics with erasure.

revusky

adMartem I wondered how you made the changes to Python for this, and now I see what you did.

Well, I have no doubt that there's a better way of doing it. I was getting some severe tunnel vision and just really wanted to get it working and I saw a way I could get it working and went with that...

Probably those indicator methods should be somehow associated with the TokenType or at least in the Lexer so that the Parser is not needed to get to them.

I suppose you're right. It shouldn't be necessary to have an instance of the parser to check the info. So, by all means, feel free to refactor it, certainly if you see a need to.

Some various things have occurred to me about ASSERT/ENSURE and fault-tolerant and so on and I was meaning to write it up today, but I guess I'll do it tomorrow.

revusky

vsajip I haven't looked into where that preamble (before the namespace declaration) comes from, but before your big refactoring, it wasn't there, and the Parser.cs file started with the namespace declaration.

Well, yes, I added the preamble up top in the Parser.cs.ctl template but that is really just the equivalent (AFAICS) of having a main() method but with less ceremony. You can have some statements up top that effectively serve as a main, an entry point into the program.

I really don't think the presence of that preamble is what is causing the failure to build. In fact, I just deleted the preamble on top of ./cs-pplineparser/Parser.cs and its presence/absence does not seem to affect whether the thing builds. (It still doesn't.) I think the issue is that somehow the .cs-csharpparser needs to say something or other for the pplineparser part to be rolled into the "assembly", as they call it.

I mean, it's not surprising that it doesn't work. I don't see anywhere where we are indicating that cs-csharpparser uses pplineparser, so...

But I honestly don't know what is needed! There are some other C# issues that are also completely befuddling me, but one thing at a time. I'd love to get the csharp parser in csharp working just so that we can benchmark it etcetera. Also, we'd have all the major languages that we are supporting with parsers in Java,C#, and Python.

vsajip

revusky that is really just the equivalent (AFAICS) of having a main() method but with less ceremony. You can have some statements up top that effectively serve as a main, an entry point into the program.
This is not a C# idiom, and I don't think it will be considered professional quality code. The norm is to have a Main method, as in Java.

It looks like your refactoring is a bit simplistic, the C# preprocessor parser is assumed to have a ParseModule class, because of this bit:

#var rootProduction = "Module"
#if extension == "cs" || extension == "java"
   #set rootProduction = "CompilationUnit"
#elif extension = "lua"
   #set rootProduction = "Root"
#endif

It might be OK as a quick hack, but it's too wedded to our built-in grammars (and perhaps not wedded enough, as the C# preprocessor is built-in too). In practice a user could be using any name for the top-level production (if there is one). IMO that whole preamble needs to go into a Main method, and perhaps a configuration value should be added to indicate a top-level production in a way that the user can control.

As the old test harnesses that you ripped out built everything correctly, that logic is perhaps what's needed to be replicated in the current build procedure.

Here's the error message:
.../congocc/examples/csharp/cs-pplineparser/Parser.cs(16,15): error CS1061: 'Parser' does not contain a definition for 'ParseModule' and no accessible extension method 'ParseModule' accepting a first argument of type 'Parser' could be found (are you missing a using directive or an assembly reference?) [.../congocc/examples/csharp/cs-pplineparser/org.parsers.csharp.ppline.csproj::TargetFramework=net6.0]

revusky

Oh, and I added a comment about a separate issue that I raised here.

revusky

vsajip It might be OK as a quick hack,

Well, that's all it is is a quick hack. In fact, it's quite ghastly. For example, the way it uses the default lexical state in the grammar as a way of deducing the file extension (and the root production) is truly nasty! I would love for you to fix that up!

but it's too wedded to our built-in grammars (and perhaps not wedded enough, as the C# preprocessor is built-in too). In practice a user could be using any name for the top-level production (if there is one). IMO that whole preamble needs to go into a Main method, and perhaps a configuration value should be added to indicate a top-level production in a way that the user can control.

Well, sure. The entry production should be properly parametrized. What you see there now is just the result of me wanting to get the thing to work. So, by all means, feel free to fix it up.

On the Java side, what I think I'm going to do is consolidate all of these test harness programs, like JParse.java and CSParse.java and so on, and just have a single test harness that is a bit more generalized and can be used for all of them, and that can be in the congocc.jar file. I'm thinking that the ability to run over a directory of source files with whatever parser could be built into the jarfile, so you just could do:

       java -jar congocc.jar jparse <directory>

and get some instant gratification that way.

Or the other possibility is to roll these things up as custom ant tasks, which is certainly possible. There is also the ability to define macros. So, maybe just leveraging Ant in a more sophisticated manner could be the right idea. Or... we could look for another build tool.

But, anyway, all those test harness programs like JParse.java and CSParse.java etc. are basically the same thing obviously. They just run over a directory and get the files and feed them to the parser. All these copy-paste-modify versions of the same thing look a bit silly. And, of course, what I did with C# is just another hack.

So I have been thinking about how to organize things better. I think that ant, as retro as it is (you're supposed to use Maven or Gradle apparently, if you want to be one of the cool kids) surely has the needed machinery to organize things better. Really, so far I have been using it in an extremely bloody-minded brain-dead manner. So just leveraging ant's feature set better, to have some ant macros, in some file like commonantmacros.xml that we import and so on.. The whole thing is gradually getting unwieldy. Though, that said, up to this point, I've never spent all that much time mucking with ant files. (And I don't really intend to start!)

And then, of course, the other possibility would be to find some other build tool. I was looking at this thing called Bazel. I guess it comes from Google. But I honestly don't kow whether it's worth bothering with. Maybe Ant is not so bad. Not that I like ant so much, and it's biggest downside is that it is horribly verbose because it's all in XML. Something more or less like ant with a more terse DSL, i.e. that doesn't use XML would maybe just be fine.

As the old test harnesses that you ripped out built everything correctly, that logic is perhaps what's needed to be replicated in the current build procedure.

Well, I don't want to get contentious over this. But the truth of the matter is that I just never got it working locally, in particular the C# tests you put in place. I guess it does work, since it did work with Github's CI, but I never managed to get the whole IronPython+Dotnet thing going locally and I never understood why. When I tried to run it locally, I just always got these messages that I couldn't make any sense of. I honestly don't know how many people ran across the project, tried to get it working, and failed and then gave up. (Surely a non-zero number.)

I guess I am not absolutely opposed to IronPython in principle, I mean having some examples that use it, but I would say that first we always needed clear examples of just running it all using the more conventional toolset. And the fact remains, I'm pretty sure that most people don't know what ipy is, so there would be a need to document the whole thing better for people. You'd need to write some README explaining what IronPython is and why it's appealing to use it as a dot-net scripting language that can script these generated parsers and so on. I don't think you quite understand how opaque the whole thing was for most people.

And the fact (and facts are stubborn things, as the adage says) remains that we just don't have any users (that I know of!) on the non-Java languages. It doesn't work too badly really, though not as well as in Java obviously, but we're trying to remedy that. But we need some tutorial material -- even on the Java side, but certainly on the C#/Python side. Things that could serve as articles somewhere, I think.

The whole thing with ANTLR has sort of galvanized me, I think. I just look at that thing and I think we've beaten the living crap out of those people technically -- at least in terms of having a usable, practical tool. Now, okay, if you need a parser in some language that we're not supporting, Javascript or whatever, then maybe one has to go with ANTLR, fine. But if you just wanted to do something in Java, CongoCC is just a vastly better tool. Actually, if you just wanted to do something in Java, and it was a choice between ANTLR and the legacy JavaCC, I think the practical choice could well be the old JavaCC. At least that, for all its limitations and glitches, tends to generate something reasonably performant and just somehow is not so strangely opaque as ANTLR. But, of course, why would anybody use the old JavaCC when CongoCC exists?