Custom DSL

palexdev

Hello, I need to develop a simple custom DSL that is a hybrid between Java/Groovy which would allow me to define a GUI as a tree structure and deserialize it by reading the text file.
I started with ANTLRv4 and I got a working system, but the more I use it, the more I don't like it (reasons).
Basically, here's what I'd like to achieve:

// Identifier for the node, simple or fully qualified. Optional constructor with ::(). Brackets for content
foo.bar.Node::(args) {
  .metadata: val // Val depends on metadata type
  property: val // Val can be anything, a node, a method call, a field, boolean, string, etc...
  foo.bar.MyClass.method1().method2() // A chain of method calls starting from a static method
  localMethod1().localMethod2() // A chain of method calls starting from one in foo.bar.Node

  // Children
  ChildNode1{}
  foo.bar.ChildNode2 {}
}

Just now I was considering to migrate to JavaCC, but the plugin seems to be broken and unmaintained.
Then I discovered this, which is the successor of JavaCC, right?
Is there a plugin/editor/IDE to support the development?
Do you think I could achieve the above example?

revusky

palexdev Just now I was considering to migrate to JavaCC, but the plugin seems to be broken and unmaintained.

Well, it's not just the plugin that is broken and unmaintained. The legacy JavaCC tool itself is broken and unmaintained.

Anyway, why don't we do this? Go through the example here: https://github.com/congo-cc/congo-parser-generator/tree/main/examples/arithmetic

Make sure you understand the (fairly simple) example thoroughly. By all means ask any questions here.

But, in any case, make sure you grasp this example thoroughly and we can go from there.

palexdev

So, I started converting from ANTLR to CongoCC and I did not get any result

// Configuration
BASE_SRC_DIR="../../";
PARSER_PACKAGE="congocc.test.generated";

//================================================================================
// PARSER
//================================================================================

Document : // Entry point
  UINode!
  <EOF>!;

UINode :
  <Identifier> <LBRACE> (Property)* <RBRACE>;

Property :
  <Identifier> <COLON> Type;

Type :
  <STRING>;

//================================================================================
// LEXER
//================================================================================

// Skip whitespaces and comments
SKIP : <Whitespace : (" "| "\t" | "\n" | "\r")+>;

UNPARSED #Comment :
  <COMMENT : "#" (~["\n"])*>;

// Symbols
TOKEN #Symbol :
    <EQUALS      : '='>
  | <PLUS_EQUALS : '+='>
  | <LPAREN      : '('>
  | <RPAREN      : ')'>
  | <LBRACK      : '['>
  | <RBRACK      : ']'>
  | <LBRACE      : '{'>
  | <RBRACE      : '}'>
  | <COLON       : ':'>
  | <COMMA       : ','>
  | <DOT         : '.'>;

TOKEN #Keyword :
    <THIS     : 'this'>
  | <BOOLEAN  : ('true' | 'false')>
  | <INFINITY : ('Infinity' | '-Infinity')>
  | <NAN      : 'NaN'>
  | <NULL     : 'null'>;

// Types
TOKEN #Types :
  // Literals
    <CHAR    : "'" (~["'", "\\", "\n", "\r"] | '\\') "'">
  | <STRING  : ("'" (~["'", "\\", "\n", "\r"] | '\\')* "'" | "\"" (~["\"", "\\", "\n", "\r"] | '\\')* "\"")>
  // Integers
  | <INTEGER : ["0"-"9"]((["0"-"9","_"])*["0"-"9"])?>
  | <LONG    : <INTEGER> (['l', 'L'])>
  | <HEX     : '0'['x', 'X'] (["0"-"9","a"-"f","A"-"F"])+>
  | <BINARY  : '0'['b', 'B'] (['0', '1'])+>
  | <OCTAL   : '0' (['0'-'7'])+>
  // Floating point
  | <FLOAT   : (<INTEGER>)? '.' <INTEGER> (['e', 'E'] ['+', '-'] <INTEGER>)? ['f', 'F']>
  | <DOUBLE  : (<INTEGER>)? '.' <INTEGER> (['e', 'E'] ['+', '-'] <INTEGER>)? (['d', 'D'])?>;

// Identifiers
TOKEN : <Identifier : ["a"-"z", "A"-"Z", "$", "_"] (["a"-"z", "A"-"Z", "$", "_"] | ((["+", "-"])?["0"-"9"]))*>;

Aside the lexer rules which I know for sure can be optimized/written differently. The parser returns null for rootNode().
For sure, the parser grammar is wrong. Point is, the lack of proper documentation makes this tool even harder to use than competitors. CongoCC makes the bold assumption that a user already knows some of this stuff, but I'm a complete neophyte.
Also, the visitor generated by ANTLR is much more intuitive to use, it all feels more 'automatic'

palexdev

revusky
Thanks, I'll check it asap.

Sorry for my little rant. It's getting really frustrating to learn these tools, it feels like I'm trying to accomplish something so big, yet conceptually seems so simple 😅

palexdev

I got something 🥳

Document : UINode <EOF>;

UINode :
  <FQN> <LBRACE>
  (Property)*
  <RBRACE>
;

Property :
  <Identifier> <COLON> <STRING>
;

// The tokens are the same as before but I added this for convenience
TOKEN : <FQN : <Identifier> (<DOT> <Identifier>)*>;

String testString = """
    foo.bar.MyNode {
      property: 'AString'
    }
    """;
UIParser parser = new UIParser(testString);
parser.Document();
Node root = parser.rootNode();
root.dump();

What would be the process now to traverse the tree?

In ANTLR a visitor would offer methods such as visitDocument(), visitNode(), visitProperty(), etc...
From this I build my model, which is then fed to a bunch of other helpers/handlers which instantiate and initialize the nodes with reflection.

Do I have to use the parser object and "manually" traverse it?

revusky

palexdev What would be the process now to traverse the tree?

Well, why not try something like this:

      class MyVisitor extends Node.Visitor {
             void visit(Token node) {
                     System.out.println("Visiting token " + node + " of type " + node.getType() + " at " + node.getLocation();
             }
      }

And elsewhere: new MyVisitor().visit(parser.rootNode());

And see what happens.

palexdev

I have the first ambiguity when I define methods as follows:

Method :
  <IDENTIFIER> <LPAREN> Args <RPAREN>
;

Methods :
  <FQN>
  (<DOT> Method)+
;

Testing against this string:

String methods = """
    foo.bar.MyNode {
      foo.bar.Class.call()
      localCall()
    }
    """;

When it reaches the ( character, it says Was expecting one of the following: DOT. Found string "(" of type LPAREN
It clearly is interpreting call as part of the FQN rule, but since there are the parentheses it should fall under the Method rule. Is there a way to tell the parser? (SCAN directive?)

palexdev

revusky
I did this:

public class TestVisitor extends Node.Visitor {
    @Override
    public void visit(Node node) {
        System.out.println("Visiting token " + node + " of type " + node.getType() + " at " + node.getLocation());
        recurse(node);
    }
}

So, this is the way of traversing the tree, good. The only confusing part are prints like this:

Visiting token property: 'c' of type null at input:2:3
Visiting token property of type IDENTIFIER at input:2:3
Visiting token : of type COLON at input:2:11
Visiting token 'c' of type CHAR at input:2:13

It visits the property's value first and returns null as the type, weird. Then it re-visits it but with the correct type, in this case CHAR

revusky
Can you help me understand how to fix ambiguities such as the one posted above, regarding methods?
That's the major issue as of now

revusky

palexdev public class TestVisitor extends Node.Visitor {
@Override
public void visit(Node node) {
System.out.println("Visiting token " + node + " of type " + node.getType() + " at " + node.getLocation());
recurse(node);
}
}

Hmm, well, you typically shouldn't be redefining visit(Node node). You should be defining visit methods for the Node subtypes that you are interested in.

In the example I gave you, it was visit(Token node) and so the machinery would traverse the tree and call your handler if the node is a Token, right? The trick is that if you haven't defined a visit method for the a node's type, it just recurses into the node's children.

palexdev Can you help me understand how to fix ambiguities such as the one posted above, regarding methods?

To be honest, I'm not sure what "ambiguities" you are referring to. Frankly, I doubt very much there is any genuine ambiguity. I mean, okay, it's not doing what you think it should, which is common enough, but that is not an ambiguity! Like, maybe it's not at the point in the grammar that you think it is... Something like that..

palexdev

revusky Hmm, well, you typically shouldn't be redefining visit(Node node). You should be defining visit methods for the Node subtypes that you are interested in.

In the example I gave you, it was visit(Token node) and so the machinery would traverse the tree and call your handler if the node is a Token, right? The trick is that if you haven't defined a visit method for the a node's type, it just recurses into the node's children.

Ooh, understood. The methods are called automatically with reflection, right?

revusky To be honest, I'm not sure what "ambiguities" you are referring to. Frankly, I doubt very much there is any genuine ambiguity. I mean, okay, it's not doing what you think it should, which is common enough, but that is not an ambiguity! Like, maybe it's not at the point in the grammar that you think it is... Something like that..

Unfortunately there is. If I write a chain of methods like this:
foo.bar.BazClazz.call()
And my rule is written like this:

Methods :
  <FQN> // Fully qualified name, shorthand for <IDENTIFIER> (<DOT> <IDENTIFIER>)*
  (<DOT> Method)+
;

Method :
  <IDENTIFIER> <LPAREN> Args <RPAREN>
;

The FQN rule ends up consuming the whole foo.bar.BazClass.call string, but call should be part of Method because it is followed by the parentheses.

Should I rewrite the grammar, use a SCAN,...?

revusky

palexdev Ooh, understood. The methods are called automatically with reflection, right?

Yes. See https://github.com/congo-cc/congo-parser-generator/blob/main/src/templates/java/Node.java.ftl#L844-L933

Note also that if there is more than one visit method that could, in principle handle a node, it takes the more specific handler. So, if you have Delimiter that subclasses Token, then if there is both:

     void visit(Token t) {...}

and:

      void visit(Delimiter d) {...}

it will call the second method on a Delimiter. But if that handler was not present it would just call the one for Token. So it's actually a much more powerful, elegant disposition than the one you are used to, I think.

palexdev The FQN rule ends up consuming the whole foo.bar.BazClass.call string, but call should be part of Method because it is followed by the parentheses.

Oh, I see... I guess you're a bit confused about the relationship between the lexical and syntactic parts of the grammar. Let's see... in principle, these are two separate machines that don't really know anything about each other. But the lexical machine is lower level. That's what is operating first. So it is partitioning the input into tokens, that's what the "lexer" does, and the "parser" consumes those tokens and matches them to the grammar rules (or productions...). But if the lexer is already breaking up the input in that way and then the parser sees the stream of tokens that it sees... well, hopefully you see my point. Actually, the confusion between what the lexer and the parser does is probably about the most basic newbie conceptual confusion in terms of using this sort of tool. Once you clear that up, you'll have come a significant way.

So, getting back to your issue, once you've defined a lexical rule that is <IDENTIFIER>(<DOT><IDENTIFIER>)* that rule is going to gobble as much input as it can (that's called "greedy matching"). So your foo.bar.BazClass.call is going to be a single FQN token and that is what your parser will see.

Almost certainly what you need is to handle this on the syntactic level, not the lexical level. Of course, the other problem with the way you have done this with the FQN token is that you would probably need the thing to handle whitespace (and comments too) because, on occasion, somebody would want to write:

                foo
                .bar
                .bazClass
                .call(...)

Typically whitespace and comments would be possible. You could maybe specify this within a lexical rule, but it gets very very messy, I would say, and the normal way to handle this would be at the syntactic level, not the lexical level.

But aside from all that, maybe you would do well to make a careful study of the existing grammars under the examples/ directory. For example, consider the QualifiedIdentifier production in the C# grammar here: https://github.com/congo-cc/congo-parser-generator/blob/main/examples/csharp/CSharp.ccc#L73-L75

and you could look at the spots where it is used. Or the DottedName production in the Python grammar here: https://github.com/congo-cc/congo-parser-generator/blob/main/examples/python/Python.ccc#L195

Well, for example, if we look at the DottedName production from the Python grammar, we see:

          DottedName : <NAME> (<DOT> <NAME> =>||)* ;

The up-to-here =>|| means, by the way, that it actually scans ahead to this point when deciding whether to consume another iteration of the loop, i.e. it checks whether the next token is a <DOT> and the one after that is a <NAME>, right? The equivalent C# production does not check that in this spot, subtle difference.

But, let's suppose we also don't want to tack on the <DOT><NAME> if the token after <NAME> is an opening parenthesis, like for a method call. We could write that as:

        DottedName : <Name> (<DOT> <NAME> ENSURE ~(<LPAREN>) =>||)* ;

So we need a <DOT> and a <NAME> and also that the next token is not an opening parenthesis.

So with that production, we could write:

        MethodCall : DottedName <DOT> <NAME> Args ;

and that might be more on track to doing what you want. But you see, above, we write the DottedName production so that in foo.bar.BazClass.call(...) it stops gobbling input after BazClass because the next three tokens are <DOT><NAME><LPAREN>.

Anyway, of course, I don't really know exactly what you want to do but the above could be food for though, maybe put you on the right track.

palexdev

revusky
I'm definitely starting to understand a bit more of these tools.
I've rewritten my rules as follows:

Method :
  <IDENTIFIER> <LPAREN> Args <RPAREN>
;

Methods :
  // I added 'this' to explicitly distinguish between static and instance methods
  (<THIS> | FQN) (<DOT> Method)+
;

FQN: <IDENTIFIER> (<DOT> <IDENTIFIER> ENSURE ~(<LPAREN>) =>||)*;

So I read this as follows, correct me if I'm wrong.
The parser starts consuming the FQN rule. The first IDENTIFIER is always consumed. Then it expects a series of <DOT> <IDENTIFIER> tokens, but before consuming them, it checks that no LPAREN is present afterward.
If that's the case, it rewinds the stack (?) and simply exits.

Can you please take a look at this too?
I'm now trying to add this new rule:

Factory :
  ('.builder' | '.factory') <COLON> Methods
;

If I test it against this string:

foo.bar.builder.Node {
  .factory: AFactory.build()
}

I get an error because .builder inside foo.bar.builder.Node is recognized as a whole. If I understood your earlier explanation. It's the lexer that is interpreting it as part of the factory rule right?
Would I be able with a lookahead to tell it: 'make sure there is a colon afterwards, otherwise it's not the factory rule but something else'?

revusky

palexdev The parser starts consuming the FQN rule. The first IDENTIFIER is always consumed. Then it expects a series of <DOT> <IDENTIFIER> tokens, but before consuming them, it checks that no LPAREN is present afterward.

Well, yeah, it might be a bit more precise to say that it generates a lookahead routine such that it only iterates the loop if the next three tokens are DOT, IDENTIFIER, and then something other than LPAREN.

palexdev If that's the case, it rewinds the stack (?) and simply exits.

Well, once the lookahead routine is NOT satisfied, then it just exits the production. The FQN method exits. You can look at the generated code and see what it is doing maybe.

palexdev I get an error because .builder inside foo.bar.builder.Node is recognized as a whole.

Yes, if you define the string ".builder" as a token then it is going to match that rather than DOT followed by IDENTIFIER. So it's not going to work, no.

Probably, again, you want to define the .builder and .factory constructs on the syntactic side, not the lexical side.

On the other hand, there is the possibility of turning off the .builder and .factory tokens to get the result you want. So you could have:

  FQN: 
       DEACTIVATE_TOKENS DOT_BUILDER, DOT_FACTORY (
            (<IDENTIFIER> (<DOT> <IDENTIFIER>) ENSURE ~(<LPAREN>) =>||)*
       );

And that should work.

palexdev

revusky Probably, again, you want to define the .builder and .factory constructs on the syntactic side, not the lexical side.

Can you please clarify what do you mean by 'on the syntactic side'?
Are you suggesting of changing it to something less ambiguous for the parses, like '@factory' for example?

I also tried this:

Factory : <FactoryMetadata> Methods;

TOKEN : <FactoryMetadata : ('.builder' | '.factory') <COLON>>;

And it works, but I guess it's just because I also added the COLON to the TOKEN, which would basically be the same as doing ('.factory' | '.builder:') Methods (?).

revusky

palexdev Can you please clarify what do you mean by 'on the syntactic side'?

Not lexically, i.e. not as a single token.