My SQL Dump

MySQL musings by a self professed MySQL Geek

Previous Entry Share Next Entry
A handy regular expression for 'tokenizing' a SQL statement
Often times I find myself having to handle a particular portion of a SQL statement via a script. I've written a lot of specialized regular expressions over time to handle these tasks, but the one that I've had to write the most is a basic 'tokenizer' which understands the quoting semantics of a MySQL statement. That is, it understands `BackTick`, 'Single Quote' and "Double Quoted" strings.

#PHP Heredoc syntax
$regex = <<< END_OF_REGEX
  [^ \"'`(),]*\([^)]*\)    #match functions like concat(x,"y",`a`.`z`) or sum(`xyz`);
  |\([^)]*?\)              #match grouped items
  |"[^"]*?"                #match double quoted items
  |'[^']*?'                #match single quoted items
  |`[A-Za-z0-9_ .'\"()+\\-&^%\$+?%\\/\\\\!`]+`  #match backtick mysql names
  |[^ ,]+                  #match keywords, operators and aliases

EDIT: After some comments and further testing, I've modified the regex to look like the following:
$regex = <<<'END_OF_REGEX'
  [A-Za-z_.]+\(.*?\)+   # Match FUNCTION(...)
  |\(.*?\)+             # Match grouped items
  |"(?:[^"]|\"|"")*"+   # Match double quotes
  |'[^'](?:|\'|'')*'+   #   and single quotes
  |`(?:[^`]|``)*`+      #   and backticks
  |[^ ,]+

This will properly match escaped values within quoted statements. Functions with nested function calls or subselects may not end up being fully grouped together. If anybody has any idea how I might fix that, well, feel free to chime in :D

When used with the preg_match_all() function, an array is produced which represents the tokenized string. Functions calls, quoted strings and grouped expressions are returned together, and may need additional processing, depending on your needs.

For example:
-- select sum(a * b) sweety,'abc' as a1, 
--        "abc",concat(`def`.`abc`,'hello', x.y, z) as `blah`,
--        null + 1 
--   from `abc`.`def` as def1 
--   join xyz.zzz z1 
--     on (z1.a = def1.a) 

    [0] => select
    [1] => sum(a * b)
    [2] => sweety
    [3] => ,
    [4] => 'abc'
    [5] => as
    [6] => a1
    [7] => ,
    [8] => "abc"
    [9] => ,
    [10] => concat(`def`.`abc`,'hello', x.y, z)
    [11] => as
    [12] => `blah`
    [13] => ,
    [14] => null
    [15] => +
    [16] => 1
    [17] => from
    [18] => `abc`.`def`
    [19] => as
    [20] => def1
    [21] => join
    [22] => xyz.zzz
    [23] => z1
    [24] => on
    [25] => (z1.a = def1.a)

I'm using this to process SQL statements and automatically create the Flexviews calls for the query.
The following output is generated by my program for the above SQL statement (only the SELECT clause is currently represented):

call flexviews.add_expr(@mvid, 'SUM', 'a * b', 'sweety');
call flexviews.add_expr(@mvid, 'GROUP', 'abc', '1');
call flexviews.add_expr(@mvid, 'GROUP', "abc", '_abc_');
call flexviews.add_expr(@mvid, 'GROUP', 'concat(`def`.`abc`,\'hello\', x.y, z)', 'blah');
call flexviews.add_expr(@mvid, 'GROUP', 'null + 1', 'null_+_1');

  • 1

Quoted strings

the regex for matching single and double quoted strings doesn't handle escaped quotes inside the string, try something like:


What do you mean by mysql backtick names? Where is this in the documentation?

It is in the section on identifiers:

Unless you have the SQL_MODE=ANSI_QUOTES, then you must use backticks to quote identifiers that contain special characters like space.

For example:
create table `x yz` (c1 int) engine=innodb;

Instead of this,

|`[A-Za-z0-9_ .'\"()+\\-&^%\$+?%\\/\\\\!`]+`

I wrote it like this...


And it seemed to work just fine.
Do you see anything wrong with this?

You can have an escaped quote character in the middle of the identifier, that is: `this is`` a valid identifier`.

I think this regex should work better and I'll edit the post in a moment:

This matches one non-backtick character optionally followed by either zero or more non-backtick characters, or exactly two backtick characters. This pattern will repeat one or more times. The ?: modifier turns off capturing for the repeating group.

#match backtick mysql names

From MySQL 5.1 all Unicode characters (with exception of the NULL character/byte) can be used for identifiers. Even non-printable characters and characters outside the unicode 'basic multilingual plane' can. So I doubt that your expression for "#match backtick mysql names" is correct with recent servers! What say?

Re: #match backtick mysql names

Wait, so I could end up with different table and column names if I change the value of 'client_character_set'?

The regex probably doesn't work with multi-byte character sets anyway. I only have to process fairly simple queries like:

SELECT column as alias, AGG_FUNC(x,y)
JOIN .. ON ..

I need to fix the expression to support nested function calls and subselects, both which currently don't work but I don't need to worry about supporting them for the moment.

It seems to me that if you need to parse complex structures like subqueries that a more robust lexer/parser should be used, not that there seems to be a good choice for that in the PHP world.

I think this expression

|`[A-Za-z0-9_ .\'\"\(\)+\\-\&\^%\$+?\%\\/\\\\!`]+` # match backtick mysql names

can simply be replaced by


What do you think?

select \`abc\` from dual :)


A best regex might be :
  [A-Za-z_][A-Za-z_.0-9]*\(.*?\)+   # Match FUNCTION(...) (function like MD5)
  |\(.*?\)+             # Match grouped items
  |"(?:[^"]|\"|"")*"+   # Match double quotes
  |'[^'](?:|\'|'')*'+   #   and single quotes
  |`(?:[^`]|``)*`+      #   and backticks
  |-                    # minus operator
  |=                    # equal operator
  |[0-9]+               # number
  |[^ \n,]+            

  • 1

Log in