My SQL Dump

MySQL musings by a self professed MySQL Geek

Previous Entry Share Next Entry
8 substitution rules for running SQL in parallel
swanhart
A lot of the work that I do on my personal projects such as Flexviews and Shard-Query involves taking a query and manipulating it in such a way as to obtain the same result, but in a different manner.  Flexviews amortizes aggregation over time, making it possible to maintain summary tables incrementally. Shard-Query (http://code.google.com/p/shard-query) spreads queries over multiple Gearman workers, allowing it to run queries on one or more MySQL servers in parallel.

Both of these tools rely on a similar set of substitution rules.  One of my favorite parts of various math classes was learning how the different mathematical expressions could be substituted for one another, allowing you to solve what otherwise appeared to be impossible problems.  Using simple substitutions, SQL can be made to solve difficult problems too, such as running queries in parallel or on multiple shards (or both).

I've tried to collect a some of the rules that I use in my tools.  Most of these rules aren't very useful for your regular SQL applications, they are meant to be applied when you build a parallel query tool like Shard-Query.

Rule #1 - Algebraic rules apply to SQL expressions
This rule is important for query optimization.  The MySQL optimizer isn't very smart about optimizing expressions so that they can use indexes. 
For example:
SELECT * FROM t WHERE some_col + interval 10 minute >= now()
 
Can be changed to:
SELECT * FROM t WHERE some_col >= now() - interval 10 minute;

Okay, that was a general rule of thumb, but pretty much everything else is related to parallel querying.

Rule #2 - An IN LIST is equivalent to a UNION ALL (as long as the are no duplicate items in the IN list)
For example:
SELECT * FROM t WHERE some_col IN (1,2,3)
Can be changed to:
SELECT * FROM t WHERE some_col = 1
UNION ALL
SELECT * FROM t WHERE some_col = 2
UNION ALL
SELECT * FROM t WHERE some_col = 3

Rule #3 - COUNT(*) = SUM(1)
For example:
SELECT COUNT(*) FROM t 
Can be changed to:
SELECT SUM(1) as `COUNT(*)` FROM t

Rule #3 is only useful in a very limited context, like Flexviews.

Rule #4 - AVG(expr) = SUM(expr)/COUNT(expr)
For example:
SELECT AVG(some_col) FROM t 
Can be changed to:
SELECT SUM(some_col)/COUNT(some_col) as `AVG(some_col)` FROM t

Of course, you knew that AVG = SUM/COUNT, but it is important to remember that this decomposition can be made to split up the work.

Rule #5 - In some cases, BETWEEN can be expressed as an IN LIST (for DATE and INTEGER type columns) This conversion to IN makes range lookups on a hash partitioned column possible, as long as the column is INTEGER or DATE.
For example:
SELECT * FROM t  where some_col between 1 and 3
Can be changed to:
SELECT * FROM t where some_col IN (1,2,3)
Note that once it is converted to an IN list, then we might be able to further convert the IN to UNION ALL (see rule 2 and #6)

Rule #6 - It is possible to run each part of a UNION ALL in parallel if you use read-committed transaction isolation level and the query uses no aggregation
The example given in rule #2 is perfect for this.  Each portion of the UNION ALL is independent and can be run in parallel without any further modification.

Rule #7 - It is possible to rewrite aggregate queries for parallel work for MIN/MAX/AVG/SUM/COUNT, but you need the additional following rewrite rules to "wrap" the UNION ALL:
COUNT(expr) := SUM(`COUNT(expr)`)
AVG(expr) := SUM(`SUM(expr)`) / SUM(`COUNT(expr)`) # see below and rule #4
For example:
SELECT AVG(some_col) as expr FROM t WHERE col2 in (1,2,3)
 
Can be changed to:
SELECT SUM(the_union.expr_s) / SUM(the_union.expr_c) as expr
FROM (
SELECT SUM(some_col) as expr_s, COUNT(some_col) as expr_c FROM t WHERE col2 = 1
UNION ALL
SELECT SUM(some_col) as expr_s, COUNT(some_col) as expr_c FROM t WHERE col2 = 2
UNION ALL
SELECT SUM(some_col) as expr_s, COUNT(some_col) as expr_c FROM t WHERE col2 = 3
) as the_union

Rule #8 - It is possible to rewrite aggregate queries for parallel work for STDDEV,VARIANCE,ETC, but you need the additional following rewrite rules to defer the aggregation until the final step.
STDDEV(expr) := SELECT ... ,expr, ... GROUP BY expr

For example:
SELECT STDDEV(some_col) as expr FROM t WHERE col2 in (1,2,3)
Can be changed to:
SELECT STDDEV(the_union.expr) as expr
FROM (
SELECT some_col as expr FROM t WHERE col2 = 1 GROUP BY some_col
UNION ALL
SELECT some_col as expr FROM t WHERE col2 = 2 GROUP BY some_col
UNION ALL
SELECT some_col as expr FROM t WHERE col2 = 3 GROUP BY some_col
) as the_union

You are viewing swanhart