My SQL Dump

MySQL musings by a self professed MySQL Geek

Previous Entry Add to Memories Share Next Entry
Intra-query parallelism for MySQL queries without an appliance or closed source database
swanhart
*edit* I want to point out that this test was done on a single database server which used MySQL partitioning. This is a demonstration of how Shard-Query can improve performance in non-sharded databases too.*edit*.

Over the weekend I spent a lot of time improving my new Shard-Query tool (code.google.com/p/shard-query) and the improvements can equate to big performance gains on partitioned data sets versus executing the query directly on MySQL.


I'll explain this graph below, but lower is better (response time) and Shard-Query is the red line.

MySQL understands that queries which access data in only certain partitions don't have to read the rest of the table. This partition elimination works well, but MySQL left a big optimization out of partitioning: getting data in parallel.

In fact, since partition elimination is the only major optimization provided by the partition options it isn't great for scaling access to large data sets when the entire data set must be accessed, but only when smaller parts of a the set are examined.

Since Shard-Query exploits parallelism with Gearman (http://www.gearman.org) I decided to extend the Shard-Query "optimizer" to support running queries with IN lists in parallel. This makes a query scale much further than it would if there was no parallelism at work.

Consider the table following partitioned fact table:
CREATE TABLE `fact` (
  `id` bigint(20) unsigned DEFAULT NULL,
  `a_id` bigint(20) unsigned DEFAULT NULL,
  `b_id` int(11) NOT NULL,
  `c_id` int(11) NOT NULL,
  `i1` tinyint(4) DEFAULT NULL,
  `qty` smallint(6) DEFAULT NULL,
  `score` decimal(10,10) DEFAULT NULL,
  `price` decimal(7,3) DEFAULT NULL,
  `i2` int(11) DEFAULT NULL,
  `i3` int(11) DEFAULT NULL,
  `wide_row` char(54) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (i1) PARTITIONS 100 */

The table is partitioned into 100 partitions, and there are 100 distinct values for i1. This means that all the values for a particular i1 are housed in a single partition.

Consider the following query:
select price*qty from fact where i1 in (1,2,3);


This query is semantically equivalent to:
select price*qty from fact where i1 = 1
UNION ALL
select price*qty from fact where i1 = 2
UNION ALL
select price*qty from fact where i1 = 3


Unfortunately, MySQL does not have any intra-query parallelism, so rewriting the query that way is not an effective scaling strategy. However, if you execute all three queries at the same time, and use a temporary table as the UNION ALL, you can actually get parallelism. This is what Shard-Query does. It can take each IN list item and assign it to a worker, and stuff the results back together at the end.

The second thing that can be done to improve performance at the partition level (or the shard level) is to push down aggregation of distributable aggregate functions to the worker. If you've read my blog before you might know that the distributable aggregate functions are SUM and COUNT.

Consider a query very much like the previous query:
select shard_col, sum(price * qty) from t1 where shard_col in (1,2,3) group by shard_col;


This query features aggregation with distributable functions. This query is semantically equivalent to the following:

This query is semantically equivalent to:
SELCT shard_col, SUM(`sum(price*qty)`) as `sum(price*qty)` 
from ( select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
       UNION ALL
       select shard_col, sum(price*qty) from t1 where shard_col = 2 group by shard_col
       UNION ALL
       select shard_col, sum(price*qty) from t1 where shard_col = 3 group by shard_col
) GROUP BY shard_col;


Shard-Query can push the aggregation down to the shards and it sends each query which is part of the "UNION ALL" operation in parallel.

I ran a benchmark based on the above table with 40M rows in it. That is 400K rows per shard.
The benchmark queries follow the pattern:
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3,4) group by i1;
...

All the way up to 100 values in the IN list. The table contains 400K rows for each i1 value.

Here is that graph again. For this test I used an EC2 "c1.large" instance. That is, 8 cores with 8 GB of memory. I used a 4GB data set, Percona Server 10.2 and a 1GB buffer pool size. Each partition is approximately 32MB in size.

Since the machine has eight cores, I started eight Gearman workers. The results are, I think, impressive. In the graph, "partitions scanned" is the number of values in the IN list.



This performs very well due to the pushdown of the aggregation. If a non-distributable aggregate function were to be used, then performance would probably be worse than MySQL because all 40M rows would be accessed and copied into a temporary table.

However, it can add a lot of parallelism to queries which scan multiple partitions but return fewer rows, that is, queries with a more restrictive where clause.

This is the simple mysql benchmark script:
[root@localhost benchmarks]# cat inlist_test.php 

$conn = mysql_connect() or die(mysql_error());
$fh = fopen('inlist.sql', 'r');
while($line = fgets($fh)) {
        $start=microtime(true);
        $stmt = mysql_query($line, $conn);
        while($row = mysql_fetch_assoc($stmt)) {
        }
        mysql_free_result($stmt);
        echo "WALLTIME::" . (microtime(true) - $start) . "s\n";
}

[root@localhost benchmarks]# head inlist.sql -n4
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3,4) group by i1;

The Shard query times were captured with:
cat inlist.sql | ./shard_query.php --ini=parallel_query.ini --one --inlist=* --pushdown|grep WALL

"SELCT shard_col, SUM(`sum(price*qty)`) as `sum(price*qty)`
from ( select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
UNION ALL
select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
UNION ALL
select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
) GROUP BY shard_col;"

should be
"SELCT shard_col, SUM(`sum(price*qty)`) as `sum(price*qty)`
from ( select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
UNION ALL
select shard_col, sum(price*qty) from t1 where shard_col = 2 group by shard_col
UNION ALL
select shard_col, sum(price*qty) from t1 where shard_col = 3 group by shard_col
) GROUP BY shard_col;"

right ?

Have you considered making this handle multiple queries? We have a Gearman worker that takes multiple queries and runs them in parallel. It would seem to be easy to simply take multiple queries to start with and break them up into even more queries.

Re: Multiple Queries?

swanhart

2010-05-25 04:24 pm (UTC)

Brian:

All you have to do is move main() into a gearman function and make it accept a commandline argument which is the SQL query, instead of reading from a file and executing serially.

Then you can spit off as many parallel queries as you want in your application. You'll need to make main() iterate and return the results.

I need to make a roadmap page. Eventually I want the whole program to be runnable as a gearman task, so that MySQL-proxy or any other tool can kick off parallel-parallel queries. Of course, it can only do as much work as you have workers.

Edited at 2010-05-25 04:25 pm (UTC)

Thanks for this wonderful post.Admiring the time and effort you put into your blog and detailed information you offer..

Thanks for the feedback.

I've made a lot of improvements to Shard-Query since this post.
http://code.google.com/p/shard-query

Looking forward to get involved

(Anonymous)

2011-04-12 05:15 pm (UTC)

Hey - I am really happy to find this. great job!

How to Implement Shard Query?

(Anonymous)

2011-06-29 10:29 am (UTC)

Hello Justin,

Having read the wonders of what shard query can do, I am very keen to implement it on my server. However, there isn't much documentation, even on google code webpage http://code.google.com/p/shard-query/. I have downloaded all the files from trunk http://shard-query.googlecode.com/svn/trunk/ and placed those files into my respective html folder and include folder. I have installed Gearman on my computer.

Next, when I open the run_query.php file using my web browser on localhost, nothing happens. I have set ini_set('display_errors','1');
ini_set('display_startup_errors','1');
error_reporting (E_ALL);
with no further errors coming out. What do I need to do to integrate it to run queries on my website on localhost?

Many Thanks.
~Beginner

Re: How to Implement Shard Query?

swanhart

2011-06-29 07:13 pm (UTC)

run_query reads queries from standard input.

echo "select a as a, count(*) cnt from some_table group by 1;" | php run_query.php --ini=myconfig.ini

Re: How to Implement Shard Query?

(Anonymous)

2011-06-30 09:15 am (UTC)

Hi Justin,

Instead of running on command line, how do I invoke shard query on my index.php script which has the following statement:

$sql = "select a as a, count(*) cnt from some_table group by 1";
$result = mysqli_query($cxn, $sql)
or die("couldn't execute query");

I understand that there is a main() function in run_query.php. Problem is how do I make it run and take $sql query as input when I open the page on a web browser?

Sorry that I am just beginning to learn OOP. Have not fully understood the inner workings yet.

Thanks for your time :)
~Beginner

Re: How to Implement Shard Query?

swanhart

2011-06-30 06:42 pm (UTC)

You will need to look at run_query.php as it is the example which shows how to use the ShardQuery->query() method to run queries.

run_query.php and loader.php are example programs that just happen to also be fully usable examples from the command line.

Re: How to Implement Shard Query?

swanhart

2011-06-30 06:45 pm (UTC)

In short you need:
$SQ = new ShardQuery($shards, $params);
$SQ->query($SQL)

All run_query does is split the input into statements using semicolon (;) as a terminator and then it passes the statement into Shard-Query.

Re: How to Implement Shard Query?

(Anonymous)

2011-07-01 10:36 am (UTC)

Thanks for the fast response. It is clearer to me now. So in short, I will need to create a new object and run the query using it's in-built query function.

By the way, I am trying to get a successful run of run_query.php first before integrating shardquery into my index.php script. Wonder if you can help me on the error below.

# echo "select lo_linenumber as a, count(*) cnt from lineorder group by lo_suppkey;" | php /var/www/html/shard/run_query.php --ini=/var/www/html/shard/shards_scaleup.ini
PHP Fatal error: Uncaught Net_Gearman_Exception: Can't connect to server in /usr/share/pear/Net/Gearman/Client.php on line 83
#0 /usr/share/pear/Net/Gearman/Client.php(83): Net_Gearman_Connection::connect('127.0.0.1:7001', 1000)
#1 /var/www/pizza/shard/shard-query.php(90): Net_Gearman_Client->__construct(Array)
#2 /var/www/html/shard/run_query.php(95): ShardQuery->query('select lo_linen...')
#3 /var/www/html/shard/run_query.php(38): main()
#4 {main}
thrown in /usr/share/pear/Net/Gearman/Connection.php on line 160

Thanks for your time.
~Beginner

Re: How to Implement Shard Query?

swanhart

2011-07-01 06:20 pm (UTC)


Re: How to Implement Shard Query?

swanhart

2011-07-01 06:22 pm (UTC)

You need to run copies of worker.php, and you need to run a gearman server. This is how shard-query works in parallel using gearmand.

Re: How to Implement Shard Query?

(Anonymous)

2011-07-03 10:37 am (UTC)

Hi Justin,

There seems to be many things to install on http://gearman.org/index.php?id=download. I have went through some of the installation procedures. I am not sure what I have missed out and whether my installation is successful. How do I know if gearman server is running in the background? I am a bit confused here. Please help me out.

Thanks a lot!

Re: How to Implement Shard Query?

swanhart

2011-07-03 08:35 pm (UTC)

You have to run gearmand then start workers. Please see the example EC2 images:
http://www.mysqlperformanceblog.com/2011/05/11/shard-query-ec2-images-available/

Re: How to Implement Shard Query?

swanhart

2011-07-03 08:37 pm (UTC)

You can get gearmand via CPAN:

cpan> install Server::Gearmand

#start gearmand
$ gearmand -p 7000 -u nobody -d

#start 8 workers
$ php run_workers 8

Re: How to Implement Shard Query?

(Anonymous)

2011-07-05 05:34 am (UTC)

Hi Justin,

Thanks for your patience. I just found out that the problem is due to a wrong port number specified in shards.ini. The port number should be 7003 instead of 7001.

Now everything works fine. Thanks!

Re: How to Implement Shard Query?

swanhart

2011-07-06 04:05 am (UTC)

Glad to hear you were able to get it to work properly.

Cheers!

Re: How to Implement Shard Query?

(Anonymous)

2011-07-08 01:08 pm (UTC)

Hi Justin,

My apologies that I have to seek your help again regarding this issue.

Previously, I did manage to get the shard_query to run successfully the first try by changing the gearman parameter to 127.0.0.1:7003 in the shards.ini file. However, subsequent attempts to run produces the PHP Fatal error "Uncaught Net_Gearman_Exception: Can't connect to server in /usr/share/pear/Net/Gearman/Client.php on line 83" error.

To get it to run again, I tried to use other ports on the gearman parameter. The only time it does run is when I input gearman = 127.0.0.1:80, with the output as follows:

-- INPUT SQL:
select lo_linenumber as a, count(*) cnt from lineorder group by lo_suppkey;
.
.
.
.
-- AGGREGATION SQL:
SELECT `a`,SUM(`cnt`) AS `cnt`
FROM `aggregation_tmp_75257222` GROUP BY 1
ON DUPLICATE KEY UPDATE
`a`=VALUES(`a`),
`cnt`=`cnt` + VALUES(`cnt`)
run_set: received another set of jobs
run_set: Starting another gearman job
run_set: finished set execution. enters: 0, spins: 0, sleeps: 0, total_sleep_time: 0
run_set: received another set of jobs
run_set: Starting another gearman job
PHP Fatal error: Uncaught Net_Gearman_Exception: Connection was reset in /usr/share/pear/Net/Gearman/Client.php on line 207
#0 /usr/share/pear/Net/Gearman/Client.php(207): Net_Gearman_Connection::read(Resource id #25)
#1 /var/www/pizza/shard/shard-query.php(1032): Net_Gearman_Client->runSet(Object(Net_Gearman_Set))
#2 /var/www/pizza/shard/shard-query.php(147): ShardQuery->run_set(Object(Net_Gearman_Set))
#3 /var/www/html/shard/run_query2.php(34): ShardQuery->query('select lo_linen...')
#4 /var/www/html/shard/run_query2.php(9): main()
#5 {main}
thrown in /usr/share/pear/Net/Gearman/Connection.php on line 253

I was wondering if the host parameter host=127.0.0.1 in the shards.ini file is causing the connection error or whether I need a .so file to be installed somewhere to make it work.

Shard Query runs into error when queries involve dates

(Anonymous)

2011-08-17 01:27 pm (UTC)

I tried to run a query on my database. It didn't work and gives the following error:

PHP Warning: mysql_fetch_assoc() expects parameter 1 to be resource, boolean given in /var/www/include/shard/mysql.php on line 46

However, once I take out the line "AND closingDate >= NOW() ORDER BY postDate DESC", shard query runs and gives me the array I want. I suspect it has something to do with dates.

Do you have any idea?

Re: Shard Query runs into error when queries involve dates

swanhart

2011-09-06 06:46 pm (UTC)

Sorry I missed this. Can you please post a bug report on the bug tracker and include the SQL that you are executing?

You are viewing swanhart