For consideration: a set-based variant of the expansion sequence

adMartem

In some parsers (for the COBOL language being one example) it is sometimes necessary to model a "at most one of each" style language element. Here is an example from the COBOL standard syntax reference. In this case, the meaning is that any of the alternatives may be chosen, but only one of each. Historically, the way I (and I assume others) have dealt with this is by adding actions to a normal EBNF choice construct or to a later visitor to restrict the choices to only occurring once in the enclosing ZeroOrMore or OneOrMore element. This can get tedious if there are many of them, and is a chance to make careless mistakes, not to mention the intent becomes separated from the grammar, which is usually not where you would prefer to look for this kind of syntactic restriction.
It occurs to me that actually implementing this behavior is fairly trivial for the way Congo builds the parser, if it is modeled as a form of ExpansionChoice where the choices are considered to be a sequential set of choices formed by the parent ExpansionWithParentheses, rather than just a sequence of choices in the present implementation. Such a thing could be notated with something like ( "foo" || "bar" || "baz" )+ or maybe ( "foo", "bar", "baz" )+. It seems to me that the impact on lookahead would be minimal to none, since I think it would be identical to a normal choice providing the semantics dictated that something like foo bar foo would be correctly parsed by ( "foo" || "bar" || "baz" )+ "foo". In other words, the ExpansionChoice is entered when any of the alternatives are available and exits when the next token(s) cannot be added to the set.
Anyway, I thought I would throw it out for discussion if anyone else thinks it is potentially useful.

revusky

Well, in principle, I think it is useful. In this vein, by the way, you might (if you haven't already) look at how the Modifiers production in the Java grammar works. See here: https://github.com/congo-cc/congo-parser-generator/blob/main/examples/java/Java.ccc#LL155

That's actually a somewhat more complex case than just a set, because there are rules about which modifiers can appear together. For example, if you already have the modifier public somewhere, it can't occur again obviously, but also certain combinations of modifiers are not permissible, like you can't have public private and you can't have abstract final and so on.

But, the notion of a choice in which an option can occur at most one time, that does seem like something useful. And I guess it's not that hard to implement either.

There are some subtle issues, however, in terms of lookahead vs. parsing. Suppose somebody does write (in Java) something like:

    public static public void foo() {....}

Well, the second public is erroneous obviously, but it seems to me that, practically speaking, it is usually better if your lookahead is forgiving of the error, but then it is caught when you actually parse the construct. Or, in other words, your predicate for entering the MethodDeclaration production is deliberately looser than the actual parsing. The advantage of that is that you recognize that it is a method declaration (albeit with the erroneous extra public modifier) so you go into that production and then hit the error there. (And, if the fault-tolerant machinery is on, it should be able to skip the extraneous modifier and keep going.)

But, you see, if your lookahead is as strict as the parsing, then it rejects MethodDeclaration and tries the next choice and the next choice and the result of that is that you're liable to get some incomprehensible error message. And also (perhaps more importantly) in a fault-tolerant mode, in which you keep parsing after an error, you want it to recognize that it is a MethodDeclaration even if there is the error that a modifier is repeated.

You see my point?

Of course, the other option is just to have a parser that is very forgiving of these things and then do a subsequent tree-walk that finds these problems. But that just punts on the problem really, I mean insofar as specifying these things inside the grammar itself. At some point, I became moderately obsessed with the whole problem of being able to specify these things in the grammar, which is basically what this is about: https://parsers.org/announcements/reference-java-grammar/

One might think about an alternative operator (I think I was thinking along the lines of the backslash though you were considering simply doubling the | operator to mean that the choice that follows can only occur once. And that could actually be the first choice, so (\ A \B \C)* would be mean zero or more of A, B, or C, but each one can occur exactly once. But that syntax is just off the top of my head really.

Well, it's all doable, but there is this matter that, arguably, we should be more disciplined about adding more features when we haven't really sufficiently documented the ones we have. So there is that... 😆

adMartem

revusky
I see what you are saying regarding the looser predicate to keep the parser in the same area as the likely problem. That makes sense.

On the notation, yes, I had not actually been thinking of selected choices being restricted, I assumed it was all or none, but I like your more general view too. My thoughts were focused (possibly too much) by the issue in COBOL, which never has the occasion to draw anything other than the all case in its syntax diagrams.

I have to confess my hesitation to even make the original post due to the angel on my shoulder saying it was feature creep. But the devil won.