My SQL Dump

MySQL musings by a self professed MySQL Geek

Previous Entry Add to Memories Share Next Entry
Flexviews - A performance overview (incremental refresh is 30x faster!)
swanhart
I keep talking about Flexviews, but I figured I'd step back and explain exactly what it does and how it can help you maintain aggregate tables cheaply. Read that again. It really is important. If you have a BI or DW running MySQL then you probably spend a lot of your time creating aggregate tables with cron jobs or some other ETL tool.

These aggregate generating queries might take many hours to run, and running multiple aggregations concurrently likely severely impacts performance. The more data you have the longer the aggregations take to run. If you are attempting to answer real-time business questions this presents a big challenge. Flexviews to the rescue. It gets rid of those cron jobs and custom aggregation scripts and replaces them with an API for creating materialized views (a fancy name for those aggregate tables). It supports two kinds of views: COMPLETE and INCREMENTAL refresh.

A COMPLETE materialized view behaves like those cron jobs. It builds a new version of the table, then swaps the new version for the old and drops the old one. This type of view has a few advantages: It supports all queries including those which use non-deterministic functions. It also supports OUTER JOIN. Essentially any query that you can run can be run as a COMPLETE refresh materialized view. If you use only COMPLETE refresh materialized views, then Flexviews will add no overhead to DML operations.

INCREMENTAL refresh is more restrictive in the types of queries you can construct. First of all, there is an API for constructing the queries from SQL fragments. You can't simply type in a SELECT statement and incrementally materialize it. Don't get me wrong, I want that to work some day, but for now it isn't possible. So thats adds some complexity overhead. Once you get used to the four or five stored procedures necessary for generating the query it is pretty easy to create views. Secondly, it doesn't support OUTER JOIN or non-deterministic functions like NOW(). It also doesn't support subqueries or unions. I originally intended to include support for them in the API, but it quickly became too complex. The only aggregate functions supported are COUNT(), SUM(), AVG(). INCREMENTAL refresh materialized views require materialized view logs (a table that records changes to the tables with triggers) be created on each of the tables involved in the view. This adds substantial overhead to DML on the tables, but the extra work is translated into incredible time savings on incremental rebuilds. The overhead is essentially divided between all the incremental refresh views so the benefits almost always outweigh the cost.

INCREMENTAL refresh materialized views can be extremely useful. You can use them to aggregate single tables or tables with joins, and as you will see in a few moments, the amount of time taken to refresh the view is proportional to the amount of changes in the base tables. This is far from true for a COMPLETE refresh view, which takes longer and longer as your data volume grows.

I created four sample tables as a simple model of a file sharing web site:

Groups - 50K rows, A file is posted into one group
Users - 6M rows, A user uploads and downloads files
Files - 9M rows, Files that have been uploaded
Download_log - 0 rows, This will be populated with batches of 100K random rows for our test

I created a materialized view to answer the question: How much has each user downloaded of each type of each file from each group? This can be used as the basis for TOP N type queries for displaying statistics on the website, list of most popular groups, etc. Of course, specialized materialized views for those queries can be maintained too.

I set up a test script. It runs a loop with 30 iterations. In each iteration 100K random rows are added to the download_log table. All random id values fall within range of 1-6M for users and 1-9M for files. After inserting the rows it refreshes the view. I ran the test script twice, one with a COMPLETE refresh materialized view, and the other with an INCREMENTAL refresh view. When running the COMPLETE refresh view I dropped the materialized view log, so there is no INSERT overhead. At the end of all 30 iterations there are 3M-3.5M rows in the materialized view.


select g.group_id,
       g.group_name,
       u.user_id,
       u.nickname,
       count(*) total_count,
       sum(f.filesize) total_size,
       f.filetype,
       d.download_date
  from foo.download_log d
  join foo.users u 
    on u.user_id = d.user_id 
   and u.status = 1
  join foo.files f 
    on f.file_id = d.file_id
  join foo.groups g 
    on f.group_id = g.group_id
 group by g.group_id,
          g.group_name,
          u.user_id,
          u.nickname,
          f.filetype,
          d.download_date

As you can see, there is a lot of overhead on INSERT statements (the LOAD values) for the incremental refresh, but it refreshes so much faster that the overhead is offset. Each refresh of the COMPLETE materialized view takes longer than the previous as the amount of data increases. The INCREMENTAL refresh view increases INSERT time but always refreshes in time proportional to the number of rows changed. In this case, the refresh always runs under 10 seconds regardless of table volume, where the COMPLETE refresh takes 300s at 3M rows.

So that is it for the Flexviews wrap-up. I hope you can find ways to use it. I will be releasing an updated version in the next few days.
Tags:

Flexviews

(Anonymous)

2008-07-09 08:11 am (UTC)

Justin,

this is an absolutely amazing thing you did here!!

It sounds like an incredibly clever solution, and very useful given the current lack of MViews in MySQL.

Kudos!

Roland Bouman
http://rpbouman.blogspot.com/

Thanks Roland.

You are viewing swanhart