Home

July 8th, 2008


I've made a few fixes to the SVN version of Flexviews:
flexviews.enable() would return an error when trying to create an incremental refresh materialized view without an AVG() aggregate expression in the select clause.

Performance of DML is now improved substantially. Flexviews is designed to work in an environment with serialized DML (such as a MySQL slave). I removed extra serialization enforcement overhead. Previously two tables were involved in coordinating the 'uow_id' generation to ensure serialization in the logs. Now only one table is utilized. I need to add a procedure to auto-cleanup this new table.


There and there are a lot more to make. On my list of things to do:
1) code cleanup
flexviews went through several iterations and there are still 'mview' references all over the
place

some of the code is encased in comments because it was extracted by mysqldump

there is some extra code here and there that isn't used anymore. There might still be
some 'special refresh' spaghetti out there too.

it needs tests too. mysqltest should suffice

2) performance
the flexviews triggers currently add quite a bit of overhead to DML statements. I'd like to
add a configuration option to mysqld to turn off 'optimizing' before images in row based logs
so that I can capture changes from a binlog instead of with triggers

3) usability
currently, flexviews is very much garbage-in, garbage-out. The stored procedures need to do
some better sanity checking of input, and the procedures need to run in a stricter SQL_MODE

I really would like to somehow integrate the MySQL parser (maybe with a UDF) into flexviews
to remove the stored procedure API around creating views. I'd like to be able to parse a
SELECT statement into a view. This is mostly a pipedream, but I'd love to figure out a
way to make it work.

There is no real documentation outside of the example, and that doesn't include a
COMPLETE refresh view. I need better documentation.

I am going to follow up this email with a performance comparison between COMPLETE and INCREMENTAL refresh views, from both DML and refresh time standpoints.

Tags:


I keep talking about Flexviews, but I figured I'd step back and explain exactly what it does and how it can help you maintain aggregate tables cheaply. Read that again. It really is important. If you have a BI or DW running MySQL then you probably spend a lot of your time creating aggregate tables with cron jobs or some other ETL tool.

These aggregate generating queries might take many hours to run, and running multiple aggregations concurrently likely severely impacts performance. The more data you have the longer the aggregations take to run. If you are attempting to answer real-time business questions this presents a big challenge. Flexviews to the rescue. It gets rid of those cron jobs and custom aggregation scripts and replaces them with an API for creating materialized views (a fancy name for those aggregate tables). It supports two kinds of views: COMPLETE and INCREMENTAL refresh.

A COMPLETE materialized view behaves like those cron jobs. It builds a new version of the table, then swaps the new version for the old and drops the old one. This type of view has a few advantages: It supports all queries including those which use non-deterministic functions. It also supports OUTER JOIN. Essentially any query that you can run can be run as a COMPLETE refresh materialized view. If you use only COMPLETE refresh materialized views, then Flexviews will add no overhead to DML operations.

INCREMENTAL refresh is more restrictive in the types of queries you can construct. First of all, there is an API for constructing the queries from SQL fragments. You can't simply type in a SELECT statement and incrementally materialize it. Don't get me wrong, I want that to work some day, but for now it isn't possible. So thats adds some complexity overhead. Once you get used to the four or five stored procedures necessary for generating the query it is pretty easy to create views. Secondly, it doesn't support OUTER JOIN or non-deterministic functions like NOW(). It also doesn't support subqueries or unions. I originally intended to include support for them in the API, but it quickly became too complex. The only aggregate functions supported are COUNT(), SUM(), AVG(). INCREMENTAL refresh materialized views require materialized view logs (a table that records changes to the tables with triggers) be created on each of the tables involved in the view. This adds substantial overhead to DML on the tables, but the extra work is translated into incredible time savings on incremental rebuilds. The overhead is essentially divided between all the incremental refresh views so the benefits almost always outweigh the cost.

INCREMENTAL refresh materialized views can be extremely useful. You can use them to aggregate single tables or tables with joins, and as you will see in a few moments, the amount of time taken to refresh the view is proportional to the amount of changes in the base tables. This is far from true for a COMPLETE refresh view, which takes longer and longer as your data volume grows.

I created four sample tables as a simple model of a file sharing web site:

Groups - 50K rows, A file is posted into one group
Users - 6M rows, A user uploads and downloads files
Files - 9M rows, Files that have been uploaded
Download_log - 0 rows, This will be populated with batches of 100K random rows for our test

I created a materialized view to answer the question: How much has each user downloaded of each type of each file from each group? This can be used as the basis for TOP N type queries for displaying statistics on the website, list of most popular groups, etc. Of course, specialized materialized views for those queries can be maintained too.

I set up a test script. It runs a loop with 30 iterations. In each iteration 100K random rows are added to the download_log table. All random id values fall within range of 1-6M for users and 1-9M for files. After inserting the rows it refreshes the view. I ran the test script twice, one with a COMPLETE refresh materialized view, and the other with an INCREMENTAL refresh view. When running the COMPLETE refresh view I dropped the materialized view log, so there is no INSERT overhead. At the end of all 30 iterations there are 3M-3.5M rows in the materialized view.


select g.group_id,
       g.group_name,
       u.user_id,
       u.nickname,
       count(*) total_count,
       sum(f.filesize) total_size,
       f.filetype,
       d.download_date
  from foo.download_log d
  join foo.users u 
    on u.user_id = d.user_id 
   and u.status = 1
  join foo.files f 
    on f.file_id = d.file_id
  join foo.groups g 
    on f.group_id = g.group_id
 group by g.group_id,
          g.group_name,
          u.user_id,
          u.nickname,
          f.filetype,
          d.download_date

As you can see, there is a lot of overhead on INSERT statements (the LOAD values) for the incremental refresh, but it refreshes so much faster that the overhead is offset. Each refresh of the COMPLETE materialized view takes longer than the previous as the amount of data increases. The INCREMENTAL refresh view increases INSERT time but always refreshes in time proportional to the number of rows changed. In this case, the refresh always runs under 10 seconds regardless of table volume, where the COMPLETE refresh takes 300s at 3M rows.

So that is it for the Flexviews wrap-up. I hope you can find ways to use it. I will be releasing an updated version in the next few days.

Tags:

Profile

[info]swanhart
Justin Swanhart
Kickfire

Advertisement

Spottt Link Exchange

Spottt
Spottt

Latest Month

November 2009
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
2930     
Powered by LiveJournal.com