My SQL Dump

MySQL musings by a self professed MySQL Geek

Previous Entry Share Next Entry
I wrote a new tool that runs aggregation queries over MySQL sharded databases using Gearman.
swanhart
I created a new tool this week:
http://code.google.com/p/shard-query

As the name Shard-Query suggests, the goal of the tool is to run a query over multiple shards, and to return the combined results together as a unified query. It uses Gearman to ask each server for a set of rows and then runs the query over the combined set. This isn't a new idea, however, Shard-Query is different than other Gearman examples I've seen, because it supports aggregation.

It does this by doing some basic query rewriting based on the input query.

Take this query for example:
select c2, 
       sum(s0.c1), 
       max(c1) 
 from t1 as s0 
 join t1 using (c1,c2) 
 where c2 = 98818 
 group by c2;


The tool will split this up into two queries.

This first query will be sent to each shard. Notice that any aggregations, grouping or sorting has been stripped away. Any joins or where clauses are still present:
select   c2 as `c2` ,   
         (s0.c1) as `(s0.c1)` ,   
         (c1) as `(c1)`   
  from   t1 s0   
  join   t1   
 using   (c1,c2)   
 where   c2   =   98818;


Shard-Query supports two different methods of coalescing the results from each shard. The first is called 'fetch' which as it sounds, fetches the results through an array, and inserts them into a temporary table. This is less desirable than the default method, which is called store. When the store method is used, each worker uses bulk insert to insert rows directly into a destination table. The downside is that the table has to be a physical table, since it must be inserted into from multiple connections. The store method is also better than fetch because the rows are inserted into the table in parallel from each shard. The insertion is serialized with the fetch method, since the results are handled by a single threaded PHP script.

Either a temporary table, or a concrete table will be created to hold the rows from all the shards depending on the fetch method.
CREATE TEMPORARY TABLE -- not a temporary table when the method is store
         IF NOT EXISTS `aggregation_tmp_#14861474` 
(        `c2` BINARY(255), 
         `(s0.c1)` BINARY(255), 
         `(c1)` BINARY(255)
) ENGINE=INNODB;


Finally, after all the rows are inserted into that table, the final rewritten query is executed. Notice that the SELECT clause has been rewritten to accommodate the column names of the temporary table. Any FROM clause has been removed and replaced with a scan of the single table. Also notice the WHERE clause is not present, because the rows were filtered at each shard:
select (`c2`), 
       sum(`(s0.c1)`), 
       max(`(c1)`)  
  from `aggregation_tmp_#14861474`                 
 group by `c2`;


Finally, the result is output and then the table is dropped.
c2      sum(`(s0.c1)`)  max(`(c1)`)
-       -               -
98818   16              8


I hope you find it useful. Right now it works with MySQL using the plain old mysql_ class of functions, but I can abstract the database access out if somebody really wants me to. This could potentially be used for just about any database which supports SQL.

You are viewing swanhart