I ended up deleting my last post because I misspoke there. I must have been somewhat sleep deprived. I reported there that I had managed to implement so-called lazy quantifiers in CongoCC, i.e. what is usually expressed as (...)*?
but that was not really true.
In fact, the real straight dope on this is that it is IMPOSSIBLE to implement lazy quantifiers in this kind of regexp engine -- somewhat akin to squaring the circle. That is why these longstanding tools like Lex and FLex and so on do not have this feature. Never have and never will...
But there is good news. I did implement a solution to the problem described but it is not as global as the lazy quantifiers would be. Here is what we have now. The classic C-style multi-line comment can now be specified as:
<?COMMENT : "/*" (~[])* "*/" >
Note the extra ?
in front of the token name. This means that the token is matched lazily. We do not try to scan for the longest match, which, as I described, is quite problematic in this case. We stop scanning as soon as we hit a final state.
But note that this is specified for the regexp pattern as a whole. There is no possibility provided of specifying that some loops inside the regexp are greedy and others are non-greedy. If you specify that the entire pattern is matched lazily, then ALL of the loops in the regexp are effectively lazy. (Or reluctant or whatever term you want to use.)
Well, being able to specify non-greedy matching only on a token as a whole is obviously less powerful, but is probably good enough in most cases. The most typical uses of lazy matching would be cases where you just really have one loop, like with this C-style comment. Or consider the relatively new text block feature in Java, that a multiline literal string can be bounded with triple quotes. Here is what I had before:
< TEXT_BLOCK_LITERAL:
'"""' (<HORIZONTAL_WHITESPACE>)* "\n"
(('"'){0,2} ((~['"', '\\']) | <STRING_ESCAPE> | "\\\n"))*
'"""'
>
It can be written a bit more simply using the lazy token feature:
<?TEXT_BLOCK_LITERAL:
'"""' (<HORIZONTAL_WHITESPACE>)* "\n"
((~['\\']) | <STRING_ESCAPE> | "\\\n")*
'"""'
>
Not so much to write home about maybe. But still, more readable. In the right spots, a lazy token can definitely make things more readable, which means more maintainable...
The funny thing is that the above, specifying a token as lazy as a whole was my first-pass implementation. That was so easy that I then got very ambitious and decided that I would implement local lazy quantifiers, not realizing that it was impossible -- I was effectively like a medieval alchemist trying to synthesize gold. Earlier today, I came across a chapter that is free online from the O'Reily book Mastering Regular Expressions. The key text there is:
An NFA engine can support many things that a DFA cannot. Among them are:
Capturing text matched by a parenthesized subexpression. Related features are backreferences and after-match information saying where in the matched text each parenthesized subexpression matched. Lookaround, and other complex zero-width assertions† (Image 133). Non-greedy quantifiers and ordered alternation. A DFA could easily support a guaranteed shortest overall match (although for whatever reason, this option never seems to be made available to the user), but it cannot implement the local laziness and ordered alternation that we’ve talked about.
So, the bottom line is that specifying lazy matching overall is about the best that can be offered in this style of regex implementation. And, besides, looking on the bright side, this sort of regular expression implementation is extremely performant. Also, the fact that there is zero support for backtracking is a double-edged proposition. If you google the string "regular expression catastrophic backtracking" or some such thing, you get all kinds of interesting information about how you can naively write a regular expression that could complete scanning after our sun supernovas in afew billion years. At least we don't have any catastrophic backtracking problems.
Because we never backtrack...