Transparent encryption does not make your database secure

Transparently encrypted storage of *any* kind (storage engine based data encryption, truecrypt volume encryption, bitkeeper, etc) is *just as insecure* to most types of attack as non-encrypted data.  SQL injection or security escalation vulnerabilities, operating system vulnerabilities and cross site scripting attacks could give attackers access to the database data.  It doesn't matter if you encrypt the database's physical storage in the database itself (in the storage engine layer) or on disk (at the filesystem level) since either way the data is presented unencrypted through the SQL interface. 

Transparent encryption is great for protecting your laptop data from theft by stealing your laptop.  It is very unlikely someone will attack your server by stealing it.

It doesn't protect you from a malicious SQL injection which drops all your tables or reads all your data.

If you are worried about someone physically stealing your data, physical encryption can help with that, but it is very likely that you only need to encrypt off site backups or other data you ship to third parties.

Also, there is no functional difference between an encrypted cloud block device (eg, truecrypt encrypted EBS volume) and transparently encrypting data and storing it directly as S3 data.  

So, don't trust transparent encryption to "secure" your MySQL data.  You have to have a full set of security practices in place to ensure your entire stack is as safe as possible from attack.  Also, always back up your data.  Just because your data is stored in S3, it doesn't mean you won't accidentally delete a wrong row, drop a wrong table or get attacked by a vulnerability.

Flexviews 1.8.0 beta 1 released

You can get Flexviews at the project page:
Flexviews project page

This release includes the following new features:

Table change logs now include a new column: fv$gsn. This column carries the "global sequence number" which uniquely identifies the order in which to make changes in the database. This was required to fix some problems identified in updating views that did not contain aggregate functions.

Updated refresh algorithm takes GSN into account and applies changes in GSN order.

All reported bugs have been fixed.

FlexCDC now supports all MySQL data types properly.

Using the MySQL event scheduler to rotate the slow query log

If you are like me, then you like to record all of your query traffic into the slow query log with long_query_time=0. Since I'm usually using Percona Server, I usually set log_slow_verbosity="microtime,query_plan,innodb,profiling' to get as much data as possible into the slow query log for analysis with pt-query-digest.

Generating a lot of data in the slow query log makes it more difficult to work with as it grows larger. It is important to rotate the log regularly to ensure it does not grow too large to work with. This can be accomplished with a cron job, but it can be done in MySQL directly using the event scheduler.

SET GLOBAL event_scheduler=ON;
DROP EVENT if exists rotate_slow_log_event;
CREATE DEFINER=root@localhost EVENT rotate_slow_log_event
ON SCHEDULE
EVERY 1 HOUR
ON COMPLETION PRESERVE
ENABLE
COMMENT 'Rotate slow query log hourly'
DO
set global slow_query_log_file=CONCAT('slow-', unix_timestamp(), '.log');
Swedish Chef (cooking)

Advantages of weighted lists in RDBMS processing

A list is simply a list of things. The list has no structure, except in some cases, the length of the list may be known. The list may contain duplicate items. In the following example the number 1 is included twice.

Example list:
1
2
3
1
Collapse )
A set is similar to a list, but has the following differences:
  1. The size of the set is always known

  2. A set may not contain duplicates

You can convert a list to a set by creating a 'weighted list'. The weighted list includes a count column so that you can determine when an item in the list appears more than once:
1,2
2,1
3,1
Notice that there are two number 1 values in the weighted list. In order to make insertions into such a list scalable, consider using partitioning to avoid large indexes.

Converting from lists to weighted lists (sets) may logically compress the data. Consider the following:
1, 10000000000000
2, 10
3, 1000
4, 100000
Consider the size of the list that would be produced if this list was not weighted by count. There are very few compression methods which allow the data to be operated on without compression. Aggregate SQL queries can use the weighted list easily without decompression.

You will need to post-process the data to create a list from a weighted list.

Flexviews 1.7.0 GA released

I am happy to announce the GA release of Flexviews. I've numbered the release 1.7.0 GA.
There are a small number of bug fixes and enhancements:

  • Fixes for views which have aggregate functions but no GROUP BY expressions could have had problems during incremental update.
  • GA support for all aggregate functions except GROUP_CONCAT and AVG(distinct).
  • The PERCENTILE_XX (ie PERCENTILE_90 for a 90th percentile calculation) is now stable too


FlexCDC got a number of fixes and improvements:

  • TIMESTAMP columns are now properly supported
  • Now mysqlbinlog errors are detected, and the program exits gracefully
  • FlexCDC now logs to a log file instead of writing to STDOUT/STDERR
  • There are new PHP scripts for adding and removing table changelogs
  • Also, a new wrapper script (consumer_safe.sh)


You can get it here: http://flexvie.ws

HandlerSocket is great, but don't compare it to Memcache just yet.

The HandlerSocket plugin for MySQL currently lacks atomic operations .  It is impossible to implement counters (increment/decrement value) or REPLACE functionality with the current implementation. 

It currently exceeds the performance of Memcache for get/set operations, but I want to see how fast it is once atomic operations are implemented. Until then, I don't think it is a serious contender for replacing Memcache for the cache layer in a complex environment.

8 substitution rules for running SQL in parallel

A lot of the work that I do on my personal projects such as Flexviews and Shard-Query involves taking a query and manipulating it in such a way as to obtain the same result, but in a different manner.  Flexviews amortizes aggregation over time, making it possible to maintain summary tables incrementally. Shard-Query (http://code.google.com/p/shard-query) spreads queries over multiple Gearman workers, allowing it to run queries on one or more MySQL servers in parallel.

Both of these tools rely on a similar set of substitution rules.  One of my favorite parts of various math classes was learning how the different mathematical expressions could be substituted for one another, allowing you to solve what otherwise appeared to be impossible problems.  Using simple substitutions, SQL can be made to solve difficult problems too, such as running queries in parallel or on multiple shards (or both).

I've tried to collect a some of the rules that I use in my tools.  Most of these rules aren't very useful for your regular SQL applications, they are meant to be applied when you build a parallel query tool like Shard-Query.

Rule #1 - Algebraic rules apply to SQL expressions
This rule is important for query optimization.  The MySQL optimizer isn't very smart about optimizing expressions so that they can use indexes. 
For example:
SELECT * FROM t WHERE some_col + interval 10 minute >= now()
 
Can be changed to:
SELECT * FROM t WHERE some_col >= now() - interval 10 minute;

Okay, that was a general rule of thumb, but pretty much everything else is related to parallel querying.

Rule #2 - An IN LIST is equivalent to a UNION ALL (as long as the are no duplicate items in the IN list)
For example:
SELECT * FROM t WHERE some_col IN (1,2,3)
Can be changed to:
SELECT * FROM t WHERE some_col = 1
UNION ALL
SELECT * FROM t WHERE some_col = 2
UNION ALL
SELECT * FROM t WHERE some_col = 3

Rule #3 - COUNT(*) = SUM(1)
For example:
SELECT COUNT(*) FROM t 
Can be changed to:
SELECT SUM(1) as `COUNT(*)` FROM t

Rule #3 is only useful in a very limited context, like Flexviews.

Rule #4 - AVG(expr) = SUM(expr)/COUNT(expr)
For example:
SELECT AVG(some_col) FROM t 
Can be changed to:
SELECT SUM(some_col)/COUNT(some_col) as `AVG(some_col)` FROM t

Of course, you knew that AVG = SUM/COUNT, but it is important to remember that this decomposition can be made to split up the work.

Rule #5 - In some cases, BETWEEN can be expressed as an IN LIST (for DATE and INTEGER type columns) This conversion to IN makes range lookups on a hash partitioned column possible, as long as the column is INTEGER or DATE.
For example:
SELECT * FROM t  where some_col between 1 and 3
Can be changed to:
SELECT * FROM t where some_col IN (1,2,3)
Note that once it is converted to an IN list, then we might be able to further convert the IN to UNION ALL (see rule 2 and #6)

Rule #6 - It is possible to run each part of a UNION ALL in parallel if you use read-committed transaction isolation level and the query uses no aggregation
The example given in rule #2 is perfect for this.  Each portion of the UNION ALL is independent and can be run in parallel without any further modification.

Rule #7 - It is possible to rewrite aggregate queries for parallel work for MIN/MAX/AVG/SUM/COUNT, but you need the additional following rewrite rules to "wrap" the UNION ALL:
COUNT(expr) := SUM(`COUNT(expr)`)
AVG(expr) := SUM(`SUM(expr)`) / SUM(`COUNT(expr)`) # see below and rule #4
For example:
SELECT AVG(some_col) as expr FROM t WHERE col2 in (1,2,3)
 
Can be changed to:
SELECT SUM(the_union.expr_s) / SUM(the_union.expr_c) as expr
FROM (
SELECT SUM(some_col) as expr_s, COUNT(some_col) as expr_c FROM t WHERE col2 = 1
UNION ALL
SELECT SUM(some_col) as expr_s, COUNT(some_col) as expr_c FROM t WHERE col2 = 2
UNION ALL
SELECT SUM(some_col) as expr_s, COUNT(some_col) as expr_c FROM t WHERE col2 = 3
) as the_union

Rule #8 - It is possible to rewrite aggregate queries for parallel work for STDDEV,VARIANCE,ETC, but you need the additional following rewrite rules to defer the aggregation until the final step.
STDDEV(expr) := SELECT ... ,expr, ... GROUP BY expr

For example:
SELECT STDDEV(some_col) as expr FROM t WHERE col2 in (1,2,3)
Can be changed to:
SELECT STDDEV(the_union.expr) as expr
FROM (
SELECT some_col as expr FROM t WHERE col2 = 1 GROUP BY some_col
UNION ALL
SELECT some_col as expr FROM t WHERE col2 = 2 GROUP BY some_col
UNION ALL
SELECT some_col as expr FROM t WHERE col2 = 3 GROUP BY some_col
) as the_union

Should MySQL change the terms regarding replication?

MySQL replication uses the term 'master' and 'slave' to refer to the machine which is be replicated from, and the machine that is being replicated to, respectively.  These terms are highly offensive to some people, given the history of slavery.  While the terms master/slave make sense for replication, given their offensive nature, perhaps MySQL should consider changing the terminology.

I suggest changing:
SHOW MASTER LOGS -> SHOW REPLICATOR LOGS;
SHOW MASTER STATUS -> SHOW REPLICATOR STATUS;

SHOW SLAVE STATUS -> SHOW REPLICATION STATUS;
START|STOP SLAVE [IO_THREAD|APPLY_THREAD] -> START|STOP REPLICATION [IO_THREAD|APPLY_THREAD]

The old commands can still work for backwards compatibility, but they should be deprecated.

Flexviews 1.6.0-RC2 is released

Available immediately Flexviews 1.6.0-RC2.

This release is a bugfix release.  This is the second Flexviews release candidate.  If no major bugs are discovered the next release will be the GA release.

Flexviews is a stored procedure managed materialized view system for MySQL 5.1 or greater. 

What is fixed in Flexviews 1.6.0-RC2?
  • Numerous performance fixes. 
    • Flexviews uses fewer temporary tables and subqueries
    • A full table scan of the view is no longer required (only changed GB keys are scanned)
    • Dead code has been removed
  • Bug fixes
    • Removing tables and adding them to a view again could result in the WHERE clause being generated in the wrong order
    • Fix a problem with applying deltas to views which use PERCENTILE
    • Improved error messages
  • FlexCDC Bug fixes
    • Bulk insert mode did not work when a transaction changed rows in more than one table

Please note that I am not currently making separate FlexCDC releases.  If you are only interested in FlexCDC, you will find it in the consumer/ subdirectory of the Flexviews tarball.  I will package FlexCDC up as a standalone release along with the GA release of Flexviews.

Flexviews 1.6.0-RC1 is released

Whats new in Flexviews 1.6.0RC1
  • This is the first release candidate before the final release.  If no major bugs are uncovered, then the next release will be the first GA release. 
  • Flexviews now has a test suite for all major features.  The creation of these tests uncovered a number of issues which have been resolved in this release. 
  • All MySQL aggregate functions except GROUP_CONCAT are now supported. 
  • A special aggregate function called PERCENTILE is now also supported.  The calculation uses a modified version of the GROUP_CONCAT based solution suggested by Roland Bouman for percentiles.  This function should be considered experimental.  Please report bugs if you find any.
  • You can add indexes to enabled materialized views using SQL_API/add_expr
  • Adding PRIMARY KEY indexes is no longer supported.  All views get an auto_incrementing primary  key.  You can add additional UNIQUE indexes instead.
  • There is an upgrade process from 1.5.3b (see UPGRADE and upgrade.sql)
Significant bug fixes
  • Views with aggregate functions but no GROUP BY columns now work properly for all supported aggregate function types
  • NULL values in GROUP BY columns are now properly supported
  • NULL values now work properly with distributive aggregate functions
  • There is a wrapper script around run_consumer.php which can restart the consumer if it stops running

As always, get it at:
http://sourceforge.net/projects/Flexviews