Ubiquitous SQL Grammar

Nngx · Apr 9, 2022

Hi

I wrote a T-SQL (Sybase actually) parser using the early-age JavaCC and I ambition to make it more generic now.
In your opinion, is it worthwhile to envisage to use a single grammar to cover more than one SQL dialect?
The two main obstacles I can see is the key words as well as the variant grammars which are already a bit ambiguous by themselves...

Thanks

revusky · Apr 11, 2022

ngx

Hi, sorry to be a bit slow replying. I only saw that you had put up the question today. Let's see... as for supporting multiple SQL dialects in a maintainable way, I suppose you realize that this kind of problem is pretty much intractable with the legacy JavaCC. It's really too under-powered since it is missing certain very basic features you would need to organize the code.

JavaCC 21 (or we should maybe get used to calling it CongoCC) does have certain key features that would allow you to attack the problem. It's still not so easy, of course, but you have a lot more degrees of freedom to organize your code in a maintainable way.

Specifically, for example, there is a preprocessor, see: https://javacc.com/2021/02/01/javacc-21-has-a-preprocessor/

So you can turn on/off parts of your grammar conditionally based on preprocessor symbols.

Also, the ability to have context-sensitive tokenization could be quite useful, as described here: https://javacc.com/2021/06/13/activating-de-activating-tokens/

Also, having an INCLUDE directive would make a big difference. So you could have a master generic SQL grammar and, in conjunction with the preprocessor, conditionally include different subgrammar files.

So, I guess the answer to your question is that what you are thinking about can be envisaged, yes. The machinery is there that you could attack the problem.

So, basically: go for it!

By the way, if you do, you could consider donating the result to this project. I think it would be quite attractive to have a real-world-useful SQL grammar in our examples directory!

And, of course, your authorship would attributed and you would be able to use it in your own projects just the same, so... Something to bear in mind.

Nngx · Apr 12, 2022

revusky

Thanks! I am in the thinking phase for now, but I will give it a go quite soon.
Pre-processor is a good thing: I remember introducing m4 (people should not know about this... I was born in 61) a long time ago for the same reasons... However, I believe that my problem would rather be solved at run-time...

For the grammar, its ok to have a super-grammar as many variants constructs are optional anyways.
For the tokens, if i know the dialect at parse time, I can certainly use the token activator indeed.

I will let you know how it goes and shar the tricks obviously.

Thanks a lot for enhancing javacc as you did. I never switched to other ones for several reasons (speed mainly, treatment of case insensitivity, philosophy...). I struggled nights over lookaheads as you may know that SQL is rather ambiguous. I am eager to see how the new constructs can help in strange productions

Nngx · Apr 14, 2022

Hi

BTW, there is an sql2016 ebnf spec here
If anyone has a ebnf to congoCC in a drawer...

Regarding the ubiquitous SQL grammar:

for the reserved word problem, I found that adding optional key words to ID definition does the job.
in this list, you can see that analy(s|z)e are reserved words for some dialects but not for all. Practically it means that select * from analyse should parse in oracle, but probably not in mysql (taking as a principle that people wont type this statement in mysql as it will not compile).
In such case, a grammar like this works:

PARSER_PACKAGE=parser.select;
PARSER_CLASS=SelectParser;
IGNORE_CASE = true;

SKIP : " " | "\t" | "\n" | "\r" ;

SPECIAL_TOKEN :
   < LINE_COMMENT: ("--" | "//") (~["\r","\n"])*>
|  < MULTI_LINE_COMMENT: "/*" (~["*"])* "*" ("*" | (~["*","/"] (~["*"])* "*"))* "/"> ;

TOKEN : 
     <SELECT : "SELECT">
   | <TIMES : "*">
   | <FROM : "FROM">
   | <ANALYSE : "ANALYSE" | "ANALYZE" >
   | <PLAINID: ["a"-"z","A"-"Z"] ( ["a"-"z","A"-"Z","0"-"9"] )* >;

Id :
  <ANALYSE> | <SELECT> | <PLAINID>;

SelectStmt :
  <SELECT> <TIMES> <FROM> Id();

AnalyseStmt :
  <ANALYSE> Id();

Stmts :
    SelectStmt
  | AnalyseStmt;

Root :
  (Stmts)+ <EOF>;

indeed, the following production parse well:

select * from table1   -- normal. works everywhere
select * from analyse  -- using kw as id. works in Oracle, but not in MySQL
select * from select   -- using kw as id. would never by typed. used for demonstration
analyse table2         -- normal. works in MySQL, but not in Oracle
analyse analyse        -- using kw as id. would never by typed
analyse select         -- using kw as id. would never by typed. used for demonstration

It probably then only requires tagging tokens with vendor for syntax coloring or auto-completion purposes...

However, for the grammar problem itself, the problem is vast. Take the postgres select definition and compare it with the same for mysql. You will quickly see that the same skeleton is decorated with many vendor ornaments. So parsing completely with a single grammar is mission impossible in my opinion.
However, how could one parse a sub syntax by skipping any token until some recognized pattern?
For example, lets compare postgres and mysql beginning of select statement and say I want to parse only the common yellow portion while skipping the specs in the red boxes:

What could be usable to express

<SELECT> ( to_skip() )* select_expression()

Where to_skip() would be basically any token as long as it does not match a select_expression().
In regex its common to skip anything not matching something. However. I never tried to express such constructs in javacc. its a sort of "look ahead until you see select_expression() and perhaps chain the tokens as a comment chain or add them linearly to the parse tree"?

Any ideas?
Thanks!

Nngx · Apr 14, 2022

Example:
select HIGH_PRIORITY bla from table1
this is valid statement in MySQL as HIGH_PRIORITY is a MySQL adornment

But in other dialects, HIGH_PRIORITY is a legit Id()
So that
`select HIGH_PRIORITY from table1``
is valid.

The difficulty arises if we try to summarize the select production as

SelectStmt :
  <SELECT>
  Id*
  ( <TIMES> | Expression )
  <FROM> Id;

As Id is also part of Expression!. Need some backtracking here!

Nngx · Apr 15, 2022

Sorry for spamming, but more generally, the problem I have is about approximative parsing or rather 'chunk parsing', i.e. parsing only certain constructs while discarding non essential tokens...

What I mean is that we would like to parse phrases like:
select <bla> <di> <bla> expression()
i.e. after having encountered the select, try to match the next expression skipping any tokens on the way...
the problem is that the tokens to be ignored are also part of the expression production, hence the ambiguity.

Canonically, It boils down to parsing:
(id)* expression
where expression can have a leading id

I am not sure how to create a semantic lookahead to do this

revusky · Apr 16, 2022

ngx

Okay, as regards this case:

select HIGH_PRIORITY bla from table1

one way to deal with this situation with a lookahead would be to write a production that is specifically only used in lookahead, i.e. something like:

  SelectHighPriorityLA#scan :
       <SELECT> <IDENTIFIER> 
       ASSERT {currentLookaheadToken.getImage().equalsIgnoreCase("HIGH_PRIORITY")}
       <IDENTIFIER>
  ;

(Note, BTW that the #scan annotation means that the production is only used in a lookahead, so no regular parsing production is generated or any skeletal node class.)

And then this can be used in a lookahead like:

          SCAN SelectHighPriorityLA
          => <SELECT> ACTIVATE_TOKENS HIGH_PRIORITY (<HIGH_PRIORITY>)
                <IDENTIFIER> <FROM> <IDENTIFIER>
          |
          ... (other choices)

So, you see, the lookahead SelecthighPriorityLA actually scans ahead and if the token after <SELECT> is not "HIGH_PRIORITY" then the lookahead fails and it goes to the next choice.

You can see this being used in the current Java grammar here: https://github.com/javacc21/javacc21/blob/master/examples/java/Java.javacc#L174

Since "record" is a soft keyword, i.e. it is a keyword in this context of a type declaration but elsewhere can just be a regular identifier, we have this trick, which is basically the same thing as what I propose above. Note that this is a little tricky because the lookahead is scanning the string "record" as an identifier but then when it actually goes into the RecordDeclaration production (see line 276 of the same file) it actually rescans the "record" as a <RECORD> token.

Note also that for this to work, we need what is on line 48 of that file, i.e.

  DEACTIVATE_TOKENS=RECORD, VAR, YIELD, SEALED, NON_SEALED, PERMITS;

The various soft tokens are de-activated by default and only activated in the spot where they are realy not just identifiers.

Actually, another example that I think worthy of study would be how certain roughly similar things are dealt with in the C# grammar I wrote recently. See here: https://github.com/javacc21/javacc21/blob/master/examples/csharp/CSharp.javacc#L1017
and: https://github.com/javacc21/javacc21/blob/master/examples/csharp/CSharp.javacc#L859

So, you see, there is this machinery to deal with context-sensitive tokenization. It is still a bit tricky because you have to be cognizant of when a token type is activated or not. So you can write a lookahead that is based on the token type not being active but then when you actually go and parse, the ACTIVATE_TOKEN actually rolls back any cached tokens from the lookahead and rescans it as the appropriate token type, so it's in the AST as that (soft) token type, not just an identifier.

revusky · Apr 16, 2022

ngx

Let's see... Offhand, I guess what you need to do is consume tokens until you can actually parse an expression, so it might be something like:

   (
          SCAN ~(Expression)
          => LeadingToken
   )*

Though, that said, my intuition would be that you might end up defining some separate production solely for the Expression lookahead, so you might end up with

     (  SCAN ~(ExpressionLA) => LeadingToken )*

But the idea would be that if the input that follows can't be parsed an expression, we just consume the token that follows, which I assume would be maybe one of several types.

Actually, it occurs to me that you might actually need a sort of compound lookahead, because you also want to check whether the next token is one of the appropriate leading token types, no? So you'd have maybe something like:
(
SCAN {getToken(1).getType() == SOME_TYPE || getToken(1).getType() == SOME_OTHER_TYPE}
_ExpressionLA
=> <SOME_TYPE>|SOME_OTHER_TYPE>
)*

Nngx · Apr 20, 2022

Hi

Thanks again for your multiple responses.
Lets remind my ubiquitous grammar problem in two points:
1) Reserved words are variable across dialects: a reserved word in Dialect1 can be a plain ID in dialect 2
2) Extra syntax (shown in below example as ID*) can exist for certain dialects around a same skeletal grammar

This seems to do the trick!

SelectStmt :
  <SELECT>
  ( SCAN ~FollowSelectLA => Id )*
  ( <TIMES> | Exprs )
  <FROM> Id;

FollowSelectLA#scan :
( <TIMES> | Exprs )
  <FROM> Id;

select B1 SELECT B2 from b5
==>

Dumping the AST...
Root
  SelectStmt
    select
    Id
      B1
    Id
      SELECT
    Exprs
      Expr
        Id
          B2
    from
    Id
      b5
  EOF

Thanks again!
Enjoy Alhambra

Nngx · Apr 20, 2022

Little question still if I may:

in

SelectStmt :
  <SELECT>
  ( SCAN ~FollowSelectLA => Id )*
  ( <TIMES> | Exprs )
  <FROM> Id;

FollowSelectLA#scan :
( <TIMES> | Exprs )
  <FROM> Id;

There is a cut/paste syndrome. Anyway to avoid that, without writing a real production for FollowSelect which is indeed only there for LA purpose?

Nngx · Apr 22, 2022

Hi

Sorry again for bombarding, but I have a general question for which you may have an answer:
Lets imagine that I have a soup of ~~quarks~~! tokens which are deemed uninteresting but in which sometimes we find a sequence which we effectively want to parse. Its a bit like saying that in a java parser, we are only interested in expressions, discarding all the rest... (This idea of partially parsing relates a bit to fault-tolerant parsers maybe)

You already hinted well on the negative SCAN approach but I wonder if this may not be not efficient enough.

I don't think I can use token activations in this context, as the token soup is full of legit tokens!

Do you think that a manual scan for leading keywords is a good approach? (which in case of expressions is a bit dull BTW as many things can start an expression)

Thanks!

Nngx · Apr 23, 2022

Hello Again!

[until you tell me to shut up, I will use this channel! ;-) ]
So I am able to skip undesired tokens like this now:

Filler :
  ( SCAN ~(ValidStmt) => (Anything) )*;

ChainedStmt :
  Filler
  ValidStmt;

ValidStmt :
    SelectStmt;

Root :
  (ChainedStmt)+ <EOF>;

this parses wonderfully:

bla + "12" insert select
select distinct HIGH_PRIORITY "a", 'a', [a], B2, B3, B4/B5+(B6) from b5

So any 'uninteresting' portion happening before a valid stmt goes in a Filler node which is perfect.... All is good

Dumping the AST...

Root
  ChainedStmt
    Filler
      bla PLAINID null
      + _PLUS null
      "12" STRING_LITERAL null
      insert INSERT [SQL2016, MySQL, Oracle, SQLServer]
      select SELECT [SQL2016, MySQL, Oracle, Postgres, SQLServer]
    SelectStmt
      select SELECT [SQL2016, MySQL, Oracle, Postgres, SQLServer]
      distinct DISTINCT [SQL2016, MySQL, Oracle, Postgres, SQLServer]
      HIGH_PRIORITY HIGH_PRIORITY [MySQL]
      Expressions
        LiteralExpression
          "a" STRING_LITERAL null
        , _COMMA null
        LiteralExpression
          'a' STRING_LITERAL null
        , _COMMA null
        LiteralExpression
          [a] STRING_LITERAL null
        , _COMMA null
        B2 PLAINID null
        , _COMMA null
        B3 PLAINID null
        , _COMMA null
        AdditiveExpression
          MultiplicativeExpression
            B4 PLAINID null
            / _SLASH null
            B5 PLAINID null
          + _PLUS null
          Parentheses
            ( _LPAR null
            B6 PLAINID null
            ) _RPAR null
      from FROM [SQL2016, MySQL, Oracle, Postgres, SQLServer]
      b5 PLAINID null
  EOF null

Except that uninteresting portion can happen in the end, before the <EOF> as in

bla + "12" insert select
select distinct HIGH_PRIORITY "a", 'a', [a], B2, B3, B4/B5+(B6) from b5
another uninteresting portion "hi" 12
<EOF>

No matter how I try, e.g.

Root :
  (ChainedStmt)+ (Anything)* <EOF>;

The parser invariably goes into ChainedStmt and fails in Syntax Error when seeing <EOF>

I tried many things like trying to include <EOF> in various production but this seems a bad idea
I also tried many other things. I am in FAULT_TOLERANT mode BTW.
Is it the negative lookahead that does not how to back out?
Any ideas?

Thanks!

Nngx · Apr 24, 2022

Oops. Was in fault_tolerant mode but missed the resync point
Adding the exclam before EOF actually does the job, although I am still not sure to get why it does not bail out when we dont resort on the fault tolerant mechanism..

Root :
  (ChainedStmt)+ ! <EOF>;

Text to parse:

bla + "12" insert select
select distinct HIGH_PRIORITY "a", 'a', [a], B2, B3, B4/B5+(B6) from b5
bla + "12" insert select

Output:

Dumping the AST...
Root
  ChainedStmt
    Filler
      bla PLAINID null
      + _PLUS null
      "12" STRING_LITERAL null
      insert INSERT [SQL2016, MySQL, Oracle, SQLServer]
      select SELECT [SQL2016, MySQL, Oracle, Postgres, SQLServer]
    SelectStmt
      select SELECT [SQL2016, MySQL, Oracle, Postgres, SQLServer]
      distinct DISTINCT [SQL2016, MySQL, Oracle, Postgres, SQLServer]
      HIGH_PRIORITY HIGH_PRIORITY [MySQL]
      Expressions
        LiteralExpression
          "a" STRING_LITERAL null
        , _COMMA null
        LiteralExpression
          'a' STRING_LITERAL null
        , _COMMA null
        LiteralExpression
          [a] STRING_LITERAL null
        , _COMMA null
        B2 PLAINID null
        , _COMMA null
        B3 PLAINID null
        , _COMMA null
        AdditiveExpression
          MultiplicativeExpression
            B4 PLAINID null
            / _SLASH null
            B5 PLAINID null
          + _PLUS null
          Parentheses
            ( _LPAR null
            B6 PLAINID null
            ) _RPAR null
      from FROM [SQL2016, MySQL, Oracle, Postgres, SQLServer]
      b5 PLAINID null
  ChainedStmt (incomplete)
    Filler (incomplete)
      bla PLAINID null
      + _PLUS null
      "12" STRING_LITERAL null
      insert INSERT [SQL2016, MySQL, Oracle, SQLServer]
      select SELECT [SQL2016, MySQL, Oracle, Postgres, SQLServer]
      Anything (incomplete)
  EOF null

revusky · Apr 24, 2022

ngx You already hinted well on the negative SCAN approach but I wonder if this may not be not efficient enough.

Well, the only way to know whether the approach is efficient enough would be to try it. My intuition is that, most likely, you'll be quite pleasantly surprised. Just thinking about this (though obviously you need to try it) If the lookahead usually fails and it fails in the first... I dunno... two or three tokens, it might not be very expensive at all -- especially if the tokens you're scanning ahead are cached.

The thing about activating/deactivating token types OR just switching lexical states is that one ends up throwing away any cached tokens, so the machinery has to retokenize, so that could be a wee bit costly. But if you're not resetting the cached tokens, then just thinking about this offhand, then it's probably all pretty cheap.

And, that said, even retokenizing the input is not even terribly expensive! As things stand, the generated parsing/lexing code is pretty efficient. Not that we are getting that much feedback (I'd like to get more and I'm pretty happy to be "bombarded" as you say) but there is some feedback and I can't recall anybody saying they have some big performance problem. OTOH, with the main competing tool, ANTLR, it's entirely different. You can see just how much buzz there is about performance issues. In fact, that's probably why the old legacy JavaCC is still fairly widely used after so many years of abandonment. People try to switch to ANTLR and the performance hit is just terrible.

revusky · Apr 24, 2022

ngx Root :
(ChainedStmt)+ ! <EOF>;

Well, yeah, first of all, I guess you figured out that, unless you put in some resync points with the !, fault-tolerant doesn't do anything! (Well, hold on, actually it does, because the various one-off token skipping tricks are in operation, but that's separate from the resync point concept... so if you don't define any resync point...)

Now, I guess you've probably also figured out that lookahead and fault-tolerant are basically orthogonal. I mean, fault-tolerant behavior only applies when you're in regular parsing, not when you're in a lookahead. So, the line above basically means that if we hit a problem when parsing a ChainedStmt construct, we're going to go into some recovery routine that tries to get back on the rails either by finding the start of a new ChainedStmnt or an EOF (since that is what the following construct is). So in this case, there could be a megabyte of unparseable goo between the last ChainedStmt and the end of the file, and it just gets discarded basically. Well, turned into an InvalidNode object, right? (It's been so long since I actually looked at this that I have to think back...)

I completed the coding of this whole fault-tolerant thing over a year ago but it's still not very mature or anything. I finished that off but sort of left it in an unpolished state because I then turned to rewriting the tokenizer part and meant to get back to really getting that into shape and never did.

Well, in short, I'm very interested in feedback from people trying to use it. There isn't really a fleshed out API for identifying the problematic parts of the AST, even though that is not really very hard to do. Or, another way to look at this is that the advantage of the current state of affairs is that you have a chance to have some serious input into how the final version works.

Nngx · Apr 25, 2022

revusky People try to switch to ANTLR and the performance hit is just terrible.

I tried it once. It was like 300 hundred time slower on my benchmarks. Plus it is (or was, this was years ago) not treating case-sensitivity correctly. So ditched immediately!