Issue with large grammar

ngx

Hello

Say one wants to congoify this grammar: https://github.com/antlr/grammars-v4/tree/master/sql/tsql
(to the question "why not using antlr directly", the response is immediate: it is so slow that it is impractical to use in production for large grammars. Is it the price to pay for not having to specifiy lookaheads by hand?)

First this grammar has 3 left recursive production that we rewrite by hand in antlr itself
Then we transform this grammar to the congo syntax and I see two issues:

1) the generated Lexer.java does not compile because of a method compiles into more than 64Kb bytecode

private static void NFA_FUNCTIONS_init() { NfaFunction[] functions = new NfaFunction[] {DEFAULT_STATE::getNfaNameDEFAULT_STATEIndex0, DEFAULT_STATE::getNfaNameDEFAULT_STATEIndex1, DEFAULT_STATE::getNfaNameDEFAULT_STATEIndex2, DEFAULT_STATE::getNfaNameDEFAULT_STATEIndex3, DEFAULT_STATE::getNfaNameDEFAULT_STATEIndex4, [... it goes on fro 5,000 lines]
This can be dealt with by post processing the java code and splitting accordingly

2) my real problem is that this grammar has ₁₅₀₀ choice conflicts such as:

Warning: Choice conflict involving two expansions at line 2547, column 5 and line 2548, column 5 respectively. A common prefix is: "CREATE" "OR" Consider using a lookahead of 3 or more for earlier expansion.
(these warnings are output by the javacc7 version of the grammar)

You say that the SANITY_CHECK option was retained, but I don't see such warnings in the congocc output.
I would love to switch to congo, but the absence of choice conflicts output is a show stopper here...
Do you have any suggestions?

Thanks!

(Incidentaly I also miss the tracing functions. You claim one can use a debugger, but the tracing approach is quicker for me: you quickly see that you entered the bad production when the output start crawling back to the left, due to the series of return)

revusky

ngx

it is so slow that it is impractical to use in production for large grammars. Is it the price to pay for not having to specifiy lookaheads by hand?)

Oh, I honestly don't know why ANTLR is so slow. My sense is that it may not be that slow generally speaking but there must be some constructs for which it generates painfully slow code. My best guess is that it shouldn't be that much slower generally. Probably they need to buckle down and figure it out. I don't really know. But, as anyone can see, the situation persists year after year and decade after decade. I think I played with it for the last time a couple of years ago and I noted that memory usage can be quite terrifying as well!

Well, anyway, I should tell you first of all that I did a bit more work on the "code too large" problem today and it should really be gone now. As things stood, the XXX_init method would still hit "code too large" if a single lexical state had somewhere north of 6000 NFA states. I didn't think that was very common, since the largest lexical grammars, like C# and Java never even reached 1000 NFA states. Actually, when I wrote my (overly triumphant) post The Dreaded “Code too large” Problem is a Thing of the Past nearly 4 years ago, I wrote the following:

As a final point, the above example assumes that no individual array is so big as to hit the "Code too large" limitation on its own. And that does seem to be the case, in practice, with JavaCC. Fairly obviously, if any single array was big enough, on its own, to hit this limitation, you would need multiple XXX_populate() methods. So, if your foo[] array had, let's say 20,000 elements, you could generate:

And so on... Basically, it puts all the NFA acceptance methods in a big array and if the array is bigger than about 6000 elements, you need to initialize it in more than one XXX_init method. But I never bothered to code the logic for that because it didn't seem that that ever really happens... (WRONG.)

But now it should be okay. See: https://github.com/congo-cc/congo-parser-generator/blob/main/src/templates/java/NfaCode.java.ftl#L30-L71 and you can see how the problem was addressed. It's really very trivial.

If the NFA_FUNCTIONS array is very big, it breaks the initialization into multiple methods now. So if you rebuild with the latest code the problem should be gone. Or you can just grab a pre-built jar.

As for the other stuff, well, the "choice conflict" thing, the history of it is that the code that implemented that in legacy JavaCC was just horrible. And I kept thinking I would clean it up. It was written in such an opaque manner that I finally just decided that it was beyond my ability to fix it. So I just tore it out. I was probably thinking at the time that I would address it later.

The thing is, though, that if you are working up a grammar in CongoCC, the whole choice conflict thing is of limited use. I mean if you have:

       <DO> <FOO> ...
       |
       <DO> <BAR> ...

Sure, the second choice is unreachable if you only look ahead the default single token. But the usefulness of having the tool tell you that is really pretty marginal. I mean, if you're writing a grammar, you typically are incrementally testing the parser on typical input and you would see that the above just doesn't work, like it never enters the second choice if you don't specify any lookahead or up-to-here. You need:

    <DO> <FOO> =>|| ...
    |
    <DO> <BAR> ....

So... Of course, if you are using a tool like ANTLR that has a runtime engine that sorts this out for you and you want to convert the entire potentially huge grammar to CongoCC, I guess that could be a problem because suddenly you have all these cases where you're missing the up-to-here.

I'm now thinking about what to do about this...

As for the tracing, to be honest, I was always kind of skeptical about how JavaCC dealt with this. With a very big grammar, it would tend to output so much that.... I mean, in general, it's a very crude disposition, no? Now, of course, with CongoCC you can just put println() (or log) calls in a code action in the productions you're interested in, no? You can also use the life-cycle hooks that exist for tree-building. For example, if you want to inject some code at the point of starting a new node or closing one in the tree-building process, you can define methods with the magic names OPEN_NODE_HOOK and CLOSE_NODE_HOOK respectively. So, you could have:

  INJECT PARSER_CLASS :
  {
         void OPEN_NODE_HOOK(Node n) {
                   do something by default at the moment of opening a new node.
         }

         void CLOSE_NODE_HOOK(Node n) {
                   do something by default at the moment of closing a node.
         }
  }

If you really really want, you have access to the call stack, actually, it's called parsingStack internally, though maybe I shouldn't encourage you (I mean not you specifically but any application programmer) to use things like that... But you can. Actually, it's probably not so bad an idea if you only use it in a read-only manner...

But, what I mean is that, what with the ability to define life-cycle hooks and so on...

Well, also, you have assertions. You can pepper your grammar code with ASSERT statements to verify things about the state of the parser or tree or whatever. Actually, for example, you could write something like:

  ASSERT {!isInProduction("Foobar")} : "We are not supposed to be in a Foobar production here!"

The assertion fails if we are currently in a Foobar production. Well, okay, stuff like this is not particularly documented, but I mean the range of debugging tricks available in CongoCC compared to legacy JavaCC is...

Well, anyway, do tell me if the "code too large" issue is fixed. I'm going to think about maybe adding back in some choice conflict (really dead code checking in practice), maybe not exhaustive, but that would cover typical situations.

revusky

ngx but the absence of choice conflicts output is a show stopper here.

On further thought, I am thinking that I will put back some warnings about this. At least very low-hanging fruit sorts of things.

      <FOO><BAR> ...
      |
      <FOO><BAZ> ...

It is certainly easy enough, for the single token lookahead case, to identify that the second choice is dead code, can't be reached, unless you look ahead 2 tokens on the first expansion. Or with:

  <FOO> ...
  |
  <BAR> ...
  |
  (<FOO>|<BAR>) ...

In the above, the third choice is dead code, because if the input started with a <FOO> or a <BAR> we would have already entered the first or second choice, so we can't enter the third one. I guess it's fairly easy to put some checks like that back in. The first set of the first expansion is {FOO} and the first set of the second choice is {BAR}. The union is {FOO,BAR}, same as the first set of the third line, so we can see that the third choice is dead code.

So I can put back a check for these things, at least the very simple cases like that. Do you think that would mostly resolve your problem?

ngx

Thanks for looking at my problem!

Actually I am afraid that ambiguities detected by the naked eye inside the same production are unfortunately not so useful as one generally clears them on the fly! Such productions generally arise from copy/pasting the grammar description from the documentation where they don't care too much about ambiguities!

Take this example:

at line 2758 you have the goto_statement():
void cfl_statement() : {} { LOOKAHEAD(2) block_statement() | break_statement() | continue_statement() | goto_statement() | if_statement() | print_statement() | raiseerror_statement() | return_statement() | throw_statement() | try_catch_statement() | waitfor_statement() | while_statement() }

lets look at the goto_statement:
void goto_statement() : {} { GOTO() id_() (LOOKAHEAD(2)SEMI())? | id_() COLON() (LOOKAHEAD(2)SEMI())? }

lets looks at id_()
void id_() : {} { ID() | TEMP_ID() | DOUBLE_QUOTE_ID() | DOUBLE_QUOTE_BLANK() | SQUARE_BRACKET_ID() | keyword() | RAW() }

Hmmm. keyword(), badly named, represents ALL keywords that can be used as an id:
void keyword() : {} { ABORT() | ABSOLUTE() | ACCENT_SENSITIVITY() | ACCESS() | ACTION() | ACTIVATION() | ACTIVE() | ADD() | ADDRESS() | AES_128() | AES_192() | AES_256() | AFFINITY() | AFTER() | AGGREGATE() | ALGORITHM() | ALL_CONSTRAINTS() | ALL_ERRORMSGS() | ALL_INDEXES() | ALL_LEVELS() | ALLOW_ENCRYPTED_VALUE_MODIFICATIONS() | ALLOW_PAGE_LOCKS() | ALLOW_ROW_LOCKS() | ALLOW_SNAPSHOT_ISOLATION() | ALLOWED() | ALWAYS() | ANSI_DEFAULTS() | ANSI_NULL_DEFAULT() | ANSI_NULL_DFLT_OFF() | ANSI_NULL_DFLT_ON() | ANSI_NULLS() | ANSI_PADDING() | ANSI_WARNINGS() | APP_NAME() | APPLICATION_LOG() | APPLOCK_MODE() | APPLOCK_TEST() | APPLY() | ARITHABORT() | ARITHIGNORE() | ASCII() | ASSEMBLY() | ASSEMBLYPROPERTY() | AT_KEYWORD() | AUDIT() | AUDIT_GUID() | AUTO() | AUTO_CLEANUP() | AUTO_CLOSE() | AUTO_CREATE_STATISTICS() | AUTO_DROP() | AUTO_SHRINK() | AUTO_UPDATE_STATISTICS() | AUTO_UPDATE_STATISTICS_ASYNC() | AUTOGROW_ALL_FILES() | AUTOGROW_SINGLE_FILE() | AVAILABILITY() | AVG() | BACKUP_CLONEDB() | BACKUP_PRIORITY() | BASE64() | BEGIN_DIALOG() | BIGINT() | BINARY_KEYWORD() | BINARY_CHECKSUM() | BINDING() | BLOB_STORAGE() | BROKER() | BROKER_INSTANCE() | BULK_LOGGED() | CALLER() | CAP_CPU_PERCENT() | CAST() | TRY_CAST() | CATALOG() | CATCH() | CERT_ID() | CERTENCODED() | CERTPRIVATEKEY() | CHANGE() | CHANGE_RETENTION() | CHANGE_TRACKING() | CHAR() | CHARINDEX() | CHECKALLOC() | CHECKCATALOG() | CHECKCONSTRAINTS() | CHECKDB() | CHECKFILEGROUP() | CHECKSUM() | CHECKSUM_AGG() | CHECKTABLE() | CLEANTABLE() | CLEANUP() | CLONEDATABASE() | COL_LENGTH() | COL_NAME() | COLLECTION() | COLUMN_ENCRYPTION_KEY() | COLUMN_MASTER_KEY() | COLUMNPROPERTY() | COLUMNS() | COLUMNSTORE() | COLUMNSTORE_ARCHIVE() | COMMITTED() | COMPATIBILITY_LEVEL() | COMPRESS_ALL_ROW_GROUPS() | COMPRESSION_DELAY() | CONCAT() | CONCAT_WS() | CONCAT_NULL_YIELDS_NULL() | CONTENT() | CONTROL() | COOKIE() | COUNT() | COUNT_BIG() | COUNTER() | CPU() | CREATE_NEW() | CREATION_DISPOSITION() | CREDENTIAL() | CRYPTOGRAPHIC() | CUME_DIST() | CURSOR_CLOSE_ON_COMMIT() | CURSOR_DEFAULT() | CURSOR_STATUS() | DATA() | DATA_PURITY() | DATABASE_PRINCIPAL_ID() | DATABASEPROPERTYEX() | DATALENGTH() | DATE_CORRELATION_OPTIMIZATION() | DATEADD() | DATEDIFF() | DATENAME() | DATEPART() | DAYS() | DB_CHAINING() | DB_FAILOVER() | DB_ID() | DB_NAME() | DBCC() | DBREINDEX() | DECRYPTION() | DEFAULT_DOUBLE_QUOTE() | DEFAULT_FULLTEXT_LANGUAGE() | DEFAULT_LANGUAGE() | DEFINITION() | DELAY() | DELAYED_DURABILITY() | DELETED() | DENSE_RANK() | DEPENDENTS() | DES() | DESCRIPTION() | DESX() | DETERMINISTIC() | DHCP() | DIALOG() | DIFFERENCE() | DIRECTORY_NAME() | DISABLE() | DISABLE_BROKER() | DISABLED() | DOCUMENT() | DROP_EXISTING() | DROPCLEANBUFFERS() | DYNAMIC() | ELEMENTS() | EMERGENCY() | EMPTY() | ENABLE() | ENABLE_BROKER() | ENCRYPTED() | ENCRYPTED_VALUE() | ENCRYPTION() | ENCRYPTION_TYPE() | ENDPOINT_URL() | ERROR_BROKER_CONVERSATIONS() | ESTIMATEONLY() | EXCLUSIVE() | EXECUTABLE() | EXIST() | EXIST_SQUARE_BRACKET() | EXPAND() | EXPIRY_DATE() | EXPLICIT() | EXTENDED_LOGICAL_CHECKS() | FAIL_OPERATION() | FAILOVER_MODE() | FAILURE() | FAILURE_CONDITION_LEVEL() | FAST() | FAST_FORWARD() | FILE_ID() | FILE_IDEX() | FILE_NAME() | FILEGROUP() | FILEGROUP_ID() | FILEGROUP_NAME() | FILEGROUPPROPERTY() | FILEGROWTH() | FILENAME() | FILEPATH() | FILEPROPERTY() | FILEPROPERTYEX() | FILESTREAM() | FILTER() | FIRST() | FIRST_VALUE() | FMTONLY() | FOLLOWING() | FORCE() | FORCE_FAILOVER_ALLOW_DATA_LOSS() | FORCED() | FORCEPLAN() | FORCESCAN() | FORMAT() | FORWARD_ONLY() | FREE() | FULLSCAN() | FULLTEXT() | FULLTEXTCATALOGPROPERTY() | FULLTEXTSERVICEPROPERTY() | GB() | GENERATED() | GETDATE() | GETUTCDATE() | GLOBAL() | GO() | GREATEST() | GROUP_MAX_REQUESTS() | GROUPING() | GROUPING_ID() | HADR() | HAS_DBACCESS() | HAS_PERMS_BY_NAME() | HASH() | HEALTH_CHECK_TIMEOUT() | HIDDEN_KEYWORD() | HIGH() | HONOR_BROKER_PRIORITY() | HOURS() | IDENT_CURRENT() | IDENT_INCR() | IDENT_SEED() | IDENTITY_VALUE() | IGNORE_CONSTRAINTS() | IGNORE_DUP_KEY() | IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX() | IGNORE_REPLICATED_TABLE_CACHE() | IGNORE_TRIGGERS() | IMMEDIATE() | IMPERSONATE() | IMPLICIT_TRANSACTIONS() | IMPORTANCE() | INCLUDE_NULL_VALUES() | INCREMENTAL() | INDEX_COL() | INDEXKEY_PROPERTY() | INDEXPROPERTY() | INITIATOR() | INPUT() | INSENSITIVE() | INSERTED() | INT() | IP() | IS_MEMBER() | IS_ROLEMEMBER() | IS_SRVROLEMEMBER() | ISJSON() | ISOLATION() | JOB() | JSON() | JSON_OBJECT() | JSON_ARRAY() | JSON_VALUE() | JSON_QUERY() | JSON_MODIFY() | JSON_PATH_EXISTS() | KB() | KEEP() | KEEPDEFAULTS() | KEEPFIXED() | KEEPIDENTITY() | KEY_SOURCE() | KEYS() | KEYSET() | LAG() | LAST() | LAST_VALUE() | LEAD() | LEAST() | LEN() | LEVEL() | LIST() | LISTENER() | LISTENER_URL() | LOB_COMPACTION() | LOCAL() | LOCATION() | LOCK() | LOCK_ESCALATION() | LOGIN() | LOGINPROPERTY() | LOOP() | LOW() | LOWER() | LTRIM() | MANUAL() | MARK() | MASKED() | MATERIALIZED() | MAX() | MAX_CPU_PERCENT() | MAX_DOP() | MAX_FILES() | MAX_IOPS_PER_VOLUME() | MAX_MEMORY_PERCENT() | MAX_PROCESSES() | MAX_QUEUE_READERS() | MAX_ROLLOVER_FILES() | MAXDOP() | MAXRECURSION() | MAXSIZE() | MB() | MEDIUM() | MEMORY_OPTIMIZED_DATA() | MESSAGE() | MIN() | MIN_ACTIVE_ROWVERSION() | MIN_CPU_PERCENT() | MIN_IOPS_PER_VOLUME() | MIN_MEMORY_PERCENT() | MINUTES() | MIRROR_ADDRESS() | MIXED_PAGE_ALLOCATION() | MODE() | MODIFY() | MODIFY_SQUARE_BRACKET() | MOVE() | MULTI_USER() | NAME() | NCHAR() | NESTED_TRIGGERS() | NEW_ACCOUNT() | NEW_BROKER() | NEW_PASSWORD() | NEWNAME() | NEXT() | NO() | NO_INFOMSGS() | NO_QUERYSTORE() | NO_STATISTICS() | NO_TRUNCATE() | NO_WAIT() | NOCOUNT() | NODES() | NOEXEC() | NOEXPAND() | NOINDEX() | NOLOCK() | NON_TRANSACTED_ACCESS() | NORECOMPUTE() | NORECOVERY() | NOTIFICATIONS() | NOWAIT() | NTILE() | NULL_DOUBLE_QUOTE() | NUMANODE() | NUMBER() | NUMERIC_ROUNDABORT() | OBJECT() | OBJECT_DEFINITION() | OBJECT_ID() | OBJECT_NAME() | OBJECT_SCHEMA_NAME() | OBJECTPROPERTY() | OBJECTPROPERTYEX() | OFFLINE() | OFFSET() | OLD_ACCOUNT() | ONLINE() | ONLY() | OPEN_EXISTING() | OPENJSON() | OPTIMISTIC() | OPTIMIZE() | OPTIMIZE_FOR_SEQUENTIAL_KEY() | ORIGINAL_DB_NAME() | ORIGINAL_LOGIN() | OUT() | OUTPUT() | OVERRIDE() | OWNER() | OWNERSHIP() | PAD_INDEX() | PAGE_VERIFY() | PAGECOUNT() | PAGLOCK() | PARAMETERIZATION() | PARSENAME() | PARSEONLY() | PARTITION() | PARTITIONS() | PARTNER() | PATH() | PATINDEX() | PAUSE() | PDW_SHOWSPACEUSED() | PERCENT_RANK() | PERCENTILE_CONT() | PERCENTILE_DISC() | PERMISSIONS() | PERSIST_SAMPLE_PERCENT() | PHYSICAL_ONLY() | POISON_MESSAGE_HANDLING() | POOL() | PORT() | PRECEDING() | PRIMARY_ROLE() | PRIOR() | PRIORITY() | PRIORITY_LEVEL() | PRIVATE() | PRIVATE_KEY() | PRIVILEGES() | PROCCACHE() | PROCEDURE_NAME() | PROPERTY() | PROVIDER() | PROVIDER_KEY_NAME() | PWDCOMPARE() | PWDENCRYPT() | QUERY() | QUERY_SQUARE_BRACKET() | QUEUE() | QUEUE_DELAY() | QUOTED_IDENTIFIER() | QUOTENAME() | RANDOMIZED() | RANGE() | RANK() | RC2() | RC4() | RC4_128() | READ_COMMITTED_SNAPSHOT() | READ_ONLY() | READ_ONLY_ROUTING_LIST() | READ_WRITE() | READCOMMITTED() | READCOMMITTEDLOCK() | READONLY() | READPAST() | READUNCOMMITTED() | READWRITE() | REBUILD() | RECEIVE() | RECOMPILE() | RECOVERY() | RECURSIVE_TRIGGERS() | RELATIVE() | REMOTE() | REMOTE_PROC_TRANSACTIONS() | REMOTE_SERVICE_NAME() | REMOVE() | REORGANIZE() | REPAIR_ALLOW_DATA_LOSS() | REPAIR_FAST() | REPAIR_REBUILD() | REPEATABLE() | REPEATABLEREAD() | REPLACE() | REPLICA() | REPLICATE() | REQUEST_MAX_CPU_TIME_SEC() | REQUEST_MAX_MEMORY_GRANT_PERCENT() | REQUEST_MEMORY_GRANT_TIMEOUT_SEC() | REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT() | RESAMPLE() | RESERVE_DISK_SPACE() | RESOURCE() | RESOURCE_MANAGER_LOCATION() | RESTRICTED_USER() | RESUMABLE() | RETENTION() | REVERSE() | ROBUST() | ROOT() | ROUTE() | ROW() | ROW_NUMBER() | ROWGUID() | ROWLOCK() | ROWS() | RTRIM() | SAMPLE() | SCHEMA_ID() | SCHEMA_NAME() | SCHEMABINDING() | SCOPE_IDENTITY() | SCOPED() | SCROLL() | SCROLL_LOCKS() | SEARCH() | SECONDARY() | SECONDARY_ONLY() | SECONDARY_ROLE() | SECONDS() | SECRET() | SECURABLES() | SECURITY() | SECURITY_LOG() | SEEDING_MODE() | SELF() | SEMI_SENSITIVE() | SEND() | SENT() | SEQUENCE() | SEQUENCE_NUMBER() | SERIALIZABLE() | SERVERPROPERTY() | SERVICEBROKER() | SESSIONPROPERTY() | SESSION_TIMEOUT() | SETERROR() | SHARE() | SHARED() | SHOWCONTIG() | SHOWPLAN() | SHOWPLAN_ALL() | SHOWPLAN_TEXT() | SHOWPLAN_XML() | SIGNATURE() | SIMPLE() | SINGLE_USER() | SIZE() | SMALLINT() | SNAPSHOT() | SORT_IN_TEMPDB() | SOUNDEX() | SPACE_KEYWORD() | SPARSE() | SPATIAL_WINDOW_MAX_CELLS() | SQL_VARIANT_PROPERTY() | STANDBY() | START_DATE() | STATIC() | STATISTICS_INCREMENTAL() | STATISTICS_NORECOMPUTE() | STATS_DATE() | STATS_STREAM() | STATUS() | STATUSONLY() | STDEV() | STDEVP() | STOPLIST() | STR() | STRING_AGG() | STRING_ESCAPE() | STUFF() | SUBJECT() | SUBSCRIBE() | SUBSCRIPTION() | SUBSTRING() | SUM() | SUSER_ID() | SUSER_NAME() | SUSER_SID() | SUSER_SNAME() | SUSPEND() | SYMMETRIC() | SYNCHRONOUS_COMMIT() | SYNONYM() | SYSTEM() | TABLERESULTS() | TABLOCK() | TABLOCKX() | TAKE() | TARGET_RECOVERY_TIME() | TB() | TEXTIMAGE_ON() | THROW() | TIES() | TIME() | TIMEOUT() | TIMER() | TINYINT() | TORN_PAGE_DETECTION() | TRACKING() | TRANSACTION_ID() | TRANSFORM_NOISE_WORDS() | TRANSLATE() | TRIM() | TRIPLE_DES() | TRIPLE_DES_3KEY() | TRUSTWORTHY() | TRY() | TSQL() | TWO_DIGIT_YEAR_CUTOFF() | TYPE() | TYPE_ID() | TYPE_NAME() | TYPE_WARNING() | TYPEPROPERTY() | UNBOUNDED() | UNCOMMITTED() | UNICODE() | UNKNOWN() | UNLIMITED_TOKEN() | UNMASK() | UOW() | UPDLOCK() | UPPER() | USER_ID() | USER_NAME() | USING() | VALID_XML() | VALIDATION() | VALUE() | VALUE_SQUARE_BRACKET() | VAR() | VARBINARY_KEYWORD() | VARP() | VERIFY_CLONEDB() | VERSION() | VIEW_METADATA() | VIEWS() | WAIT() | WELL_FORMED_XML() | WITHOUT_ARRAY_WRAPPER() | WORK() | WORKLOAD() | XLOCK() | XML() | XML_COMPRESSION() | XMLDATA() | XMLNAMESPACES() | XMLSCHEMA() | XSINIL() | ZONE() | ABORT_AFTER_WAIT() | ABSENT() | ADMINISTER() | AES() | ALLOW_CONNECTIONS() | ALLOW_MULTIPLE_EVENT_LOSS() | ALLOW_SINGLE_EVENT_LOSS() | ANONYMOUS() | APPEND() | APPLICATION() | ASYMMETRIC() | ASYNCHRONOUS_COMMIT() | AUTHENTICATE() | AUTHENTICATION() | AUTOMATED_BACKUP_PREFERENCE() | AUTOMATIC() | AVAILABILITY_MODE() | BEFORE() | BLOCK() | BLOCKERS() | BLOCKSIZE() | BLOCKING_HIERARCHY() | BUFFER() | BUFFERCOUNT() | CACHE() | CALLED() | CERTIFICATE() | CHANGETABLE() | CHANGES() | CHECK_POLICY() | CHECK_EXPIRATION() | CLASSIFIER_FUNCTION() | CLUSTER() | COMPRESS() | COMPRESSION() | CONNECT() | CONNECTION() | CONFIGURATION() | CONNECTIONPROPERTY() | CONTAINMENT() | CONTEXT() | CONTEXT_INFO() | CONTINUE_AFTER_ERROR() | CONTRACT() | CONTRACT_NAME() | CONVERSATION() | COPY_ONLY() | CURRENT_REQUEST_ID() | CURRENT_TRANSACTION_ID() | CYCLE() | DATA_COMPRESSION() | DATA_SOURCE() | DATABASE_MIRRORING() | DATASPACE() | DDL() | DECOMPRESS() | DEFAULT_DATABASE() | DEFAULT_SCHEMA() | DIAGNOSTICS() | DIFFERENTIAL() | DISTRIBUTION() | DTC_SUPPORT() | ENABLED() | ENDPOINT() | ERROR() | ERROR_LINE() | ERROR_MESSAGE() | ERROR_NUMBER() | ERROR_PROCEDURE() | ERROR_SEVERITY() | ERROR_STATE() | EVENT() | EVENTDATA() | EVENT_RETENTION_MODE() | EXECUTABLE_FILE() | EXPIREDATE() | EXTENSION() | EXTERNAL_ACCESS() | FAILOVER() | FAILURECONDITIONLEVEL() | FAN_IN() | FILE_SNAPSHOT() | FORCESEEK() | FORCE_SERVICE_ALLOW_DATA_LOSS() | FORMATMESSAGE() | GET() | GET_FILESTREAM_TRANSACTION_CONTEXT() | GETANCESTOR() | GETANSINULL() | GETDESCENDANT() | GETLEVEL() | GETREPARENTEDVALUE() | GETROOT() | GOVERNOR() | HASHED() | HEALTHCHECKTIMEOUT() | HEAP() | HIERARCHYID() | HOST_ID() | HOST_NAME() | IIF() | IO() | INCLUDE() | INCREMENT() | INFINITE() | INIT() | INSTEAD() | ISDESCENDANTOF() | ISNULL() | ISNUMERIC() | KERBEROS() | KEY_PATH() | KEY_STORE_PROVIDER_NAME() | LANGUAGE() | LIBRARY() | LIFETIME() | LINKED() | LINUX() | LISTENER_IP() | LISTENER_PORT() | LOCAL_SERVICE_NAME() | LOG() | MASK() | MATCHED() | MASTER() | MAX_MEMORY() | MAXTRANSFER() | MAXVALUE() | MAX_DISPATCH_LATENCY() | MAX_DURATION() | MAX_EVENT_SIZE() | MAX_SIZE() | MAX_OUTSTANDING_IO_PER_VOLUME() | MEDIADESCRIPTION() | MEDIANAME() | MEMBER() | MEMORY_PARTITION_MODE() | MESSAGE_FORWARDING() | MESSAGE_FORWARD_SIZE() | MINVALUE() | MIRROR() | MUST_CHANGE() | NEWID() | NEWSEQUENTIALID() | NOFORMAT() | NOINIT() | NONE() | NOREWIND() | NOSKIP() | NOUNLOAD() | NO_CHECKSUM() | NO_COMPRESSION() | NO_EVENT_LOSS() | NOTIFICATION() | NTLM() | OLD_PASSWORD() | ON_FAILURE() | OPERATIONS() | PAGE() | PARAM_NODE() | PARTIAL() | PASSWORD() | PERMISSION_SET() | PER_CPU() | PER_DB() | PER_NODE() | PERSISTED() | PLATFORM() | POLICY() | PREDICATE() | PROCESS() | PROFILE() | PYTHON() | R() | READ_WRITE_FILEGROUPS() | REGENERATE() | RELATED_CONVERSATION() | RELATED_CONVERSATION_GROUP() | REQUIRED() | RESET() | RESOURCES() | RESTART() | RESUME() | RETAINDAYS() | RETURNS() | REWIND() | ROLE() | ROUND_ROBIN() | ROWCOUNT_BIG() | RSA_512() | RSA_1024() | RSA_2048() | RSA_3072() | RSA_4096() | SAFETY() | SAFE() | SCHEDULER() | SCHEME() | SCRIPT() | SERVER() | SERVICE() | SERVICE_BROKER() | SERVICE_NAME() | SESSION() | SESSION_CONTEXT() | SETTINGS() | SHRINKLOG() | SID() | SKIP_KEYWORD() | SOFTNUMA() | SOURCE() | SPECIFICATION() | SPLIT() | SQL() | SQLDUMPERFLAGS() | SQLDUMPERPATH() | SQLDUMPERTIMEOUT() | STATE() | STATS() | START() | STARTED() | STARTUP_STATE() | STOP() | STOPPED() | STOP_ON_ERROR() | SUPPORTED() | SWITCH() | TAPE() | TARGET() | TCP() | TOSTRING() | TRACE() | TRACK_CAUSALITY() | TRANSFER() | UNCHECKED() | UNLOCK() | UNSAFE() | URL() | USED() | VERBOSELOGGING() | VISIBILITY() | WAIT_AT_LOW_PRIORITY() | WINDOWS() | WITHOUT() | WITNESS() | XACT_ABORT() | XACT_STATE() | ABS() | ACOS() | ASIN() | ATAN() | ATN2() | CEILING() | COS() | COT() | DEGREES() | EXP() | FLOOR() | LOG10() | PI() | POWER() | RADIANS() | RAND() | ROUND() | SIGN() | SIN() | SQRT() | SQUARE() | TAN() | CURRENT_TIMEZONE() | CURRENT_TIMEZONE_ID() | DATE_BUCKET() | DATEDIFF_BIG() | DATEFROMPARTS() | DATETIME2FROMPARTS() | DATETIMEFROMPARTS() | DATETIMEOFFSETFROMPARTS() | DATETRUNC() | DAY() | EOMONTH() | ISDATE() | MONTH() | SMALLDATETIMEFROMPARTS() | SWITCHOFFSET() | SYSDATETIME() | SYSDATETIMEOFFSET() | SYSUTCDATETIME() | TIMEFROMPARTS() | TODATETIMEOFFSET() | YEAR() | QUARTER() | DAYOFYEAR() | WEEK() | HOUR() | MINUTE() | SECOND() | MILLISECOND() | MICROSECOND() | NANOSECOND() | TZOFFSET() | ISO_WEEK() | WEEKDAY() | YEAR_ABBR() | QUARTER_ABBR() | MONTH_ABBR() | DAYOFYEAR_ABBR() | DAY_ABBR() | WEEK_ABBR() | HOUR_ABBR() | MINUTE_ABBR() | SECOND_ABBR() | MILLISECOND_ABBR() | MICROSECOND_ABBR() | NANOSECOND_ABBR() | TZOFFSET_ABBR() | ISO_WEEK_ABBR() | WEEKDAY_ABBR() | SP_EXECUTESQL() | VARCHAR() | NVARCHAR() | PRECISION() | FILESTREAM_ON() }

I am a bit nasty as I purposedly included the extent of the keyword() production, just to give you an impression of what a horrible grammar can be!

JavaCC hopefully tells me:
Warning: Choice conflict involving two expansions at line 2758, column 6 and line 2763, column 6 respectively. A common prefix is: "THROW" Consider using a lookahead of 2 for earlier expansion.

I guess that you can gauje how useful these warning messages can be when conflicts happen across productions!
These warnings allow me to add lookaheads where appropriate, and I have confidence to put out a grammar which, if incorrect, is at least not due to forgotten lookaheads at choice conflicts locations!

To give you an idea of the magnitude of the problem, I have today 714 lines with one or more LOOKAHEAD and more to come:
excerpt of grep LOOKAHEAD parser.jjt:
| DOUBLE_COLON() function_call() (LOOKAHEAD(2)as_table_alias())? OPENXML() LR_BRACKET() expression() COMMA() expression() ( COMMA() expression() )? RR_BRACKET() (LOOKAHEAD(2) WITH() LR_BRACKET() schema_declaration() RR_BRACKET() )? (LOOKAHEAD(2)as_table_alias())? OPENJSON() LR_BRACKET() expression() ( COMMA() expression() )? RR_BRACKET() (LOOKAHEAD(2) WITH() LR_BRACKET() json_declaration() RR_BRACKET() )? (LOOKAHEAD(2)as_table_alias())? LOOKAHEAD(3) change_table_changes() |LOOKAHEAD(2) cross_join() full_column_name() (LOOKAHEAD(2) COMMA() full_column_name() ) * | (BULK() STRING() COMMA() (LOOKAHEAD(2) bulk_option() (LOOKAHEAD(2) COMMA() bulk_option() ) * | id_() ) RR_BRACKET() ) ) ) LOOKAHEAD(3) subquery() | LOOKAHEAD(3) scalar_function_name() LR_BRACKET() (expression_list_())? RR_BRACKET() | TRIM() LR_BRACKET() (LOOKAHEAD(2) expression() FROM() )? expression() RR_BRACKET() |LOOKAHEAD(3) MIN_ACTIVE_ROWVERSION() LR_BRACKET() RR_BRACKET() |LOOKAHEAD(3) IDENTITY() LR_BRACKET() data_type() ( COMMA() DECIMAL() COMMA() DECIMAL() )? RR_BRACKET() | IDENTITY() LR_BRACKET() data_type() (LOOKAHEAD(2) COMMA() DECIMAL() )? ( COMMA() DECIMAL() )? RR_BRACKET() | JSON_OBJECT() LR_BRACKET() (LOOKAHEAD(2) json_key_value() ( COMMA() json_key_value() )* )? (json_null_clause())? RR_BRACKET() | JSON_ARRAY() LR_BRACKET() (LOOKAHEAD(2)expression_list_())? (json_null_clause())? RR_BRACKET() LOOKAHEAD(2) value_method() |LOOKAHEAD(2) query_method() |LOOKAHEAD(2) exist_method() ( LOOKAHEAD(2) LOCAL_ID() |LOOKAHEAD(2) full_column_name() | EVENTDATA() LR_BRACKET() RR_BRACKET() |LOOKAHEAD(2) query_method() | LR_BRACKET() subquery() RR_BRACKET() ) DOT() value_call() (LOOKAHEAD(2) sybase_legacy_hint()) + ( AVG() | MAX() | MIN() | SUM() | STDEV() | STDEVP() | VAR() | VARP() ) LR_BRACKET() all_distinct_expression() RR_BRACKET() (LOOKAHEAD(2)over_clause())? | ( COUNT() | COUNT_BIG() ) LR_BRACKET() ( STAR() | all_distinct_expression() ) RR_BRACKET() (LOOKAHEAD(2)over_clause())? LOOKAHEAD(2) window_frame_preceding() FILESTREAM() ( database_filestream_option() (LOOKAHEAD(2) COMMA() database_filestream_option() )* ) FILEGROUP() id_() (LOOKAHEAD(2) CONTAINS() FILESTREAM() )? ( DEFAULT_TOKEN() )? ( CONTAINS() MEMORY_OPTIMIZED_DATA() )? file_spec() (LOOKAHEAD(2) [LOOKAHEAD(2) prefix_list()] id_() ( LOOKAHEAD(2) prefix() )+ id_() (LOOKAHEAD(2) DOT() id_())* | BLOCKING_HIERARCHY() (LOOKAHEAD(2) id_() DOT() )? id_() [LOOKAHEAD(2) prefix_list()] id_() (LOOKAHEAD(2) (id_())? DOT() )* id_() BEGIN() CONVERSATION() TIMER() LR_BRACKET() LOCAL_ID() RR_BRACKET() TIMEOUT() EQUAL() time() (LOOKAHEAD(2)SEMI())? service_name() ( COMMA() STRING() )? ON() CONTRACT() contract_name() (LOOKAHEAD(2) )? (LOOKAHEAD(2)SEMI())? (LOOKAHEAD(2) id_() | expression() ) (LOOKAHEAD(2) id_() | expression() ) END() CONVERSATION() LOCAL_ID() (LOOKAHEAD(2)SEMI())? (LOOKAHEAD(2) (WAITFOR())? LR_BRACKET() get_conversation() RR_BRACKET() ( (COMMA())? TIMEOUT() time() )? (LOOKAHEAD(2)SEMI())? GET() CONVERSATION() GROUP() ( STRING() | LOCAL_ID() ) FROM() queue_id() (LOOKAHEAD(2)SEMI())? LOOKAHEAD(2)( id_() DOT() id_() DOT() id_() ) SEND() ON() CONVERSATION() ( STRING() | LOCAL_ID() ) MESSAGE() TYPE() expression() (LOOKAHEAD(2) )? (LOOKAHEAD(2)SEMI())? LOOKAHEAD(3) (LOOKAHEAD(3) VARCHAR() | NVARCHAR() | BINARY_KEYWORD() | VARBINARY_KEYWORD() | SQUARE_BRACKET_ID() ) LR_BRACKET() MAX() RR_BRACKET()

(oh and BTW, the second reason for discounting Antlr is that it coluntarily does NOT produce ASTs... Having a listener of the parse tree is of little use when one wants to analyze code)

Rapidly looking at the JavaCC7 code, it seems that the two places where javaCC7 issue such warnings are here:
https://github.com/javacc/javacc/blob/5830da892352485813179c6938b477bbdb858be7/src/main/java/org/javacc/parser/LookaheadCalc.java#L164
and
https://github.com/javacc/javacc/blob/5830da892352485813179c6938b477bbdb858be7/src/main/java/org/javacc/parser/LookaheadCalc.java#L254

Would it be difficult to retrofit these guys inside your code?
Because (1) for me it would be super useful, and (2) maybe for many people coming to grammars or learning, having these warnings would help them figuring out how the parser works in general!

Thanks!

revusky

ngx Actually I am afraid that ambiguities detected by the naked eye inside the same production are unfortunately not so useful as one generally clears them on the fly!

Well, maybe I wasn't clear about one thing. The first set logic will look into nested non-terminals. So, in the example I gave, where we had:

         <FOO> ...
         |
         <BAR> ....
         |
         (<FOO>|<BAR>) ...

Any code that tells you that the third choice won't be entered would also work if these were separate productions, like:

          StartsWithFoo()
          |
          StartsWithBar()
          |
          StartsWithFooOrBar()

The (already implemented) first set logic, as in Expansion::firstSet() can certainly figure out the first set of the various choices and identify the fact that the third choice is dead code, since either <FOO> or <BAR> would already be matched by the first or second choice. Well, the bottom line is that it's probably not too hard to put back in the warnings you need.

By the way, these aren't really ambiguities. (At least not in MY mental universe!) The notion that this is an ambiguity is something from the theory of context-free grammars, where, somehow the choices all have equal priority. Of course, in any real-world implementation, the expansions are checked in order and it just takes the first match. So, if it's the default case of just checking 1 token ahead, then the third choice is just dead code. But there is really nothing "ambiguous" about it.

That said, it may not matter that much as a practical question whether you say you are warning about "dead code" (i.e. the expansion will never be entered) or a so-called "choice ambiguity"....

Now, in terms of your example, that keyword() production with nearly a thousand (over 900) different keywords... well, these are actually contextual keywords to be precise. In a certain context, they are keywords, but elsewhere, they are just identifiers, right? For example, Java has a number of these contextual keywords, such as yield or record and others. C# has a lot more contextual keywords. CongoCC has dispositions to dealing with this problem, but I am not really sure that it is best to deal with these things as contextual keywords when you have this many of them.

But, regardless of that, you can see how the C# grammar handles it. The contextual keywords are defined, but are deactivated by default, as you can see here: https://github.com/congo-cc/congo-parser-generator/blob/main/examples/csharp/CSharp.ccc#L14

And then it typically activates the relevant token type(s) in the key spot. For example, consider this point in the RecordDeclaration production: https://github.com/congo-cc/congo-parser-generator/blob/main/examples/csharp/CSharp.ccc#L354

We activate the RECORD soft keyword so that it is matched at that spot. But elsewhere "record" is just matched as an identifier.

Now, in terms of the specific case you mentioned, the throw_statement, the problem seems to be that there is an "ambiguity" between:

          throw ;

and:

         throw :

The first should be parsed as a throw_statement and the second should be a goto_statement, right?. In the first case, this is a contextual keyword throw and in the second case, throw is just an identifier.

Just eyeballing this, it seems like maybe the cleanest solution is just to write the throw statement as:

       throw_statement : 
             SCAN 0 {getToken(1).toString().equalsIgnoreCase("throw") && 
                          !getToken(2).toString().equals(":")}#
             =>
             ACTIVATE_TOKENS THROW (<THROW>) 
             etc...
        ;

Of course, it occurs to me that this throw_statement would have to occur before the goto_statement. But you see, because of the SCAN (semantic lookahead, it was called in legacy JavaCC) you don't enter the production if you have "throw" followed by a colon. So then the goto_statement can match that case further down.

Of course, it might be simpler not to match these various contextual (or soft) keywords as separate token types. In that case, the above would just be:

    throw_statement : 
             SCAN 0 {getToken(1).toString().equalsIgnoreCase("throw") && 
                          !getToken(2).toString().equals(":")}#
             => <IDENTIFIER> etc...
    ;

In that case, the "throw" token would just be in the AST as an identifier. But it might not matter really.

Or, actually, maybe that last one could be written more elegantly as:

           throw_statement
                <IDENTIFIER> 
                ENSURE {getToken(0).toString().equalsIgnoreCase("throw")}
                ENSURE ~(<COLON>)
                =>||
               etc...
           ;

Well, anyway, don't be shy about making comments or asking questions...

revusky

ngx Would it be difficult to retrofit these guys inside your code?
Because (1) for me it would be super useful, and (2) maybe for many people coming to grammars or learning, having these warnings would help them figuring out how the parser works in general!

I put the warnings back in. At the moment, this is only in my own fork and also the prebuilt jarfile here

But, anyway, please do try it out. Basically, it only works on simple LL(1), i.e. when we have the default one-token lookahead in effect. Once the construct specifies any sort of lookahead (or up-to-here) then there is no dead code check (a.k.a. ambiguity check) because it just assumes that you know what you're doing. But actually, that is how legacy JavaCC worked (and works) as well.

But I think this should do what you want. Try it and let me know.

revusky

revusky I put the warnings back in. At the moment, this is only in my own fork and also the prebuilt jarfile here

The warnings about dead code are now merged into the main repository.