My SQL Dump

MySQL musings by a self professed MySQL Geek

Returning to the original licenses of my tools
I was mad with Percona. They f*cked me over big time. But they can still use my tools. I'm not an asshole, at least not permanently. Such a license change is not actually workable (the older version with the older license could always be used) and it isn't in the spirit of open source.

My apologies.

New license for Swanhart-Tools and my other MySQL projects
The following new license is now in effect:
Clause #1:
The following companies:
Percona staffing llc, and any associated parent, child or ancillary entities

May not download, retain, install, operate, or in any way associate themselves with this software. This is without exception. Any use must immediately and permanently cease. Violation of this license will be pursued in a court of law.

The remainder of the license is normal 3 clause BSD. The above preamble overrides any relevant clauses in that license with respect to rights to use the software.

Long live MariaDB!

Advanced JSON for MySQL: indexing and aggregation for highly complex JSON documents

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don't solve the really tough JSON problems we face.

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can't use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can't be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I've come up with a technique to effectively index JSON data in MySQL (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren't normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn't end up being usable when the name of the field (the key) can't be associated with the value. But what if we transform the JSON? You can "flatten" the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }

mysql> select RAPID_FLATTEN_JSON(load_file('/tmp/doc.json'))\G
*************************** 1. row ***************************
RAPID_FLATTEN_JSON(load_file('/tmp/doc.json')): id=0001
type=Devil's Food
type=Powdered Sugar
type=Chocolate with Sprinkles
1 row in set (0.00 sec)

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don't want this as it removes the locality of the keyword and the value. To fix this problem you'll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don't support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:
create table json2(id int auto_increment primary key, doc mediumtext);

Add a column for the index data and FULLTEXT index it:
alter table json2 add flat mediumtext character set latin1 collate ft_kvpair_ci, FULLTEXT(flat);

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.
mysql> update json2 set flat=RAPID_FLATTEN_JSON(doc);
Query OK, 18801 rows affected (26.34 sec)
Rows matched: 18801  Changed: 18801  Warnings: 0

Using the index:
mysql> select count(*) from json2 where match(flat) against ('last_name=Vembu');
| count(*) |
|        3 |
1 row in set (0.00 sec)

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a "total sales" figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, 'key'). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:
mysql> select json_extract_all(doc,'id') ids, jsum(json_extract_all(doc,'id')) from json2 limit 1\G    
*************************** 1. row ***************************
ids: 888

jsum(json_extract_all(doc,'id')): 18660.00000
1 row in set (0.01 sec)

Aggregating all of the id values in the entire collection:
mysql> select sum( jsum(json_extract_all(doc,'id')) ) from json2 ;
| sum( jsum(json_extract_all(doc,'id')) ) |
|                         296615411.00000 |
1 row in set (2.90 sec)

Of course you could extract other fields and sort and group on them.

Where to get the tools:
You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON documents in MySQL.

Take the long view on the MySQL PERFORMANCE_SCHEMA with ps_history and sys_history
The performance_schema is a powerful tool for analyzing MySQL performance and behavior. One aspect of the performance_schema is that the view of the data is "right now", and very little historical information is present. You can see that there are 10 threads connected right now, but what about five minutes ago?

ps_history is a set of stored routines and events for periodically collecting the data in the performance_schema into another schema called ps_history. The ps_history schema contains a copy of each performance_schema view as a real table, and timestamp and server_id columns have been added to each table. Periodically (by default every 30 seconds) the performance_schema data is written into the history tables.

ps_history comes as one script (setup.sql) which will create the ps_history schema, the tables within it, and create an event for collecting data. To collect data into the history tables you must either enable the event scheduler (set global event_scheduler=1) or you must periodically run the ps_history.collect() stored routine.

ps_history automatically removes old data from the history tables. By default 1 week of history will be retained. You can change the retention period with ps_history.set_retention_period(), for example to keep 2 weeks of history use ps_history.set_retention_period('2 WEEK'); to keep 10 days, ps_history.set_retention_period('10 DAY');

Here is an example from the hosts table:
mysql> select * from ps_history.hosts limit 4\G
*************************** 1. row ***************************
               HOST: localhost
          server_id: 33
                 ts: 2015-02-09 13:17:16.68339
*************************** 2. row ***************************
               HOST: NULL
          server_id: 33
                 ts: 2015-02-09 13:17:16.68339
*************************** 3. row ***************************
               HOST: localhost
          server_id: 33
                 ts: 2015-02-09 13:17:47.69614
*************************** 4. row ***************************
               HOST: NULL
          server_id: 33
                 ts: 2015-02-09 13:17:47.69614
4 rows in set (0.00 sec)

sys_history is a modified version of the SYS schema to use the data from the ps_history tables instead of the performance_schema. Each of the views in the SYS schema have been modified to include the timestamp (ts) field from the ps_history tables, and joins have been modified to include ts in the join.

Here is an example of one of the aggregated views:
mysql> select * from io_by_thread_by_latency limit 4\G
*************************** 1. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: page_cleaner_thread
         total: 4809257
 total_latency: 00:08:13.93
   min_latency: 66.98 ns
   avg_latency: 1.02 ms
   max_latency: 99.29 ms
     thread_id: 18
processlist_id: NULL
*************************** 2. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: io_handler_thread
         total: 1081585
 total_latency: 00:04:44.69
   min_latency: 432.51 ns
   avg_latency: 263.22 us
   max_latency: 36.52 ms
     thread_id: 8
processlist_id: NULL
*************************** 3. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: io_handler_thread
         total: 412302
 total_latency: 00:01:34.81
   min_latency: 604.82 ns
   avg_latency: 229.94 us
   max_latency: 69.59 ms
     thread_id: 11
processlist_id: NULL
*************************** 4. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: io_handler_thread
         total: 368534
 total_latency: 00:01:32.18
   min_latency: 589.63 ns
   avg_latency: 250.12 us
   max_latency: 37.70 ms
     thread_id: 10
processlist_id: NULL
4 rows in set (5.91 sec)

Announcing a new MySQL fork: TroySQL
First, I believe that Oracle MySQL, Percona MySQL and SkySQL MySQL (MariaDB) are all great databases with their own merits. You should probably try all of them and see which one works best for you. Personally I like Percona Server, but all of them are good.

So why do we need ANOTHER fork when there are already three perfectly acceptable commercial forks?
a) Experimentation
b) Community Involvement
c) Serious feature/functionality divergence/disparity
d) Extensibility
e) Storage engines
f) OLAP focus

There are a lot of interesting and useful things in the database world that are not possible in MySQL. Many of these features have been implemented as middleware or add-ons as third party tools, but they don't get wide adoption because they are not "built-in" to the database. Two good examples are Flexviews and Shard-Query. Other experiments and worklogs have had significant work checked into other branches and forks, but that work has never been accepted into upstream. A good example of that are "perl stored procedures", or "external stored procedures". Another is allowing plugins to add variables to the THD without modifying the source. These are good contributions that are in danger of being lost. They should not languish in a tree somewhere. I think there should be an active fork to test these ideas.

Community Involvement:
If the fork encourages experimentation, then community involvement is more likely. If the fork is more accepting of code contributions (like those with alternate licenses like BSD two and three clause) this too encourages community involvement. So both are true with TroySQL. Experimentation is encouraged, and you can contribute code under any license deemed compatible with GPL V2 (LGPLv2, BSD2,BSD3). This does not mean that every feature submitted for inclusion will make it into a release or that it will work exactly as submitted, but if you want a new cool feature like user defined types, or you want to really make foreign keys work for all engines, then this would be a great place to try it out.

Serious feature/functionality divergence/disparity:
While TroySQL does not plan to diverge so far from MySQL as to be unrecognizable, there is a high price to be paid for guaranteeing compatibility with regular MySQL. This price may be acceptable in commercial forks but I'd rather say "not guaranteed compatible" and just be done with it. So TroySQL will maintain compatibility with regular MySQL where possible, but when necessary it will diverge. Given the desire for experimentation divergence is necessary and should not be considered a negative quality.

Instead of plugins with a C interface, I want to start moving to TRIGGERS. Binlog event triggers, user logon/logoff triggers, statement triggers, commit triggers, etc. This makes writing extensions in any language possible, as a trigger can be written in any language that the database supports. Many current plugins can be rewritten to use the trigger interface instead. There will still be a use for C plugins, particularly for storage engines.

Storage engines:
A column store is needed, and a compressing one. Index plugins too so that bitmap and text indexes can be implemented in an engine agnostic way. Proper triggers for row changes that can invoke C code will of course be useful too which can lead to proper FK support for all engines.

I would like to take the CONNECT storage engine, and implement proper SQL/MED data wrapper syntax on top of it instead of using it as a different storage engine.

OLAP focus:
None of the current forks plan on adding intra-query parallelism any time soon. This feature is necessary for good performance on large volumes of database when large volumes of data must be examined by queries. TroySQL will support query parallelism, starting with SELECT queries. Users have been asking for window functions and common table expressions. These will be added too, as well as table functions.

Materialized views, bitmap indexes, a column store, star join optimization and query rewrite are needed too. An optimizer that can use more than one join type and that has a hash join that can spill to disk would be nice too, as would sort/merge joins.

Release plan:
TroySQL will be released as "preview releases". Each preview release will have one or two new features. Some features might have a lot of depth, others might be simple. Each preview release will be named, and there may be multiple preview releases for a feature. So, for example, the goal of the first preview release is query rewrite, parallel query, php stored procedures and perl stored procedures. It will be called "TroySQL 5.6.X Green Swan PR1". Then, as development progresses, "Green Swan PR2" will be released. Once everything is well tested and feature complete, it will become "TroySQL 5.6.X Green Swan GA1" Finally, in a GA release, point release improvements (no new features) are labeled as "TroySQL 5.6.X Green Swan GA1.1, ... GA1.2, etc).

This will identify a feature set with a name. The next release will include everything in Green Swan and add new features. Thus, it is easy to identify when a test release is being used, and because the major MySQL release number is always in the name, it is clear which name is newer than the others.

Why Green Swan? Well, Black Swan would of course be apropos, but my favorite color is Green, and my name is Swanhart. And a Black Swan is something unexpected, thus Green Swan is even more unexpected. Or something like that :)

Where to find it:

Working with comma separated list MySQL options
Over time, some options have crept into the MySQL server which are comma separated lists of options. These include SQL_MODE, optimizer_switch, optimizer_trace and a few other variables.

Optimizer_switch is particularly problematic to work with as it contains many options and is hard to interpret. This is what you usually see when you examine optimizer_switch:


As you can see, seeing which option is on or off is rather difficult. You can use the REPLACE function to make this easier:
mysql> select replace(@@optimizer_switch, ',','\n')\G
*************************** 1. row ***************************
replace(@@optimizer_switch, ',','\n'): index_merge=on
1 row in set (0.00 sec)

This way you can use the \P mysql monitor option to search for key/value pairs:
mysql> \P grep mrr
PAGER set to 'grep mrr'
mysql> select replace(@@optimizer_switch, ',','\n')\G
1 row in set (0.00 sec)

A new big data structure for streaming counters - bit length encoding
One of the challenges of big data is that it is, well, big. Computers are optimized for math on 64 bits or less. Any bigger, and extra steps have to be taken to work with the data which is very expensive. This is why a BIGINT is 64 bits.  In MySQL DECIMAL can store more than 64 bits of data using fixed precision.  Large numbers can use FLOAT or DECIMAL but those data types are lossy.

DECIMAL is an expensive encoding. Fixed precision math is expensive and you eventually run out of precision at which point you can't store any more data, right?

What happens when you want to store a counter that is bigger than the maximum DECIMAL?  FLOAT is lossy.  What if you need an /exact/ count of a very big number without using very much space?

I've developed an encoding method that allows you to store very large counters in a very small amount of space. It takes advantage of the fact that counters are always increasing.  We can take advantage of the fact that the counter always increases by also choosing an arbitrary bit size in which to do math.  If this sounds confusing, the example below should be simple to understand.

Decide on a large counter (total number of bits ever downloaded from wikipedia for example).  There are two goals that have to be met if the counter is to be manageable.

  1. The data must be stored in a compact way so that it doesn't take a long time to read the counter value or write the counter value

  2. The data must be stored in such a way that a counter with a million digits does not have to print 1 million digits on the screen, but that can instead use an alternate notation.

  3. You see this with floating point operations all the time, which remove decimal places by using powers of 10.

  4. That idea can be extended to any counter using POWERS OF TWO notation without ever losing precision.

  5. The data must be stored in a way that allows further computation by the machine in the new representation.

Storing the data in the database:

  1. Define a counter.  We'll call it bits_on_the_internet.

  2. In the database use two columns.  One is an INTcolumn called COUNTER_DATA and the other a DECIMAL(60,0) column called COUNTER_DATA_OVERFLOWS.

  3. Use 64 bit math to add the size (in bits) of each document to the counter and store in memory as NEW_COUNTER_DATA.

  4. The new value stored in 64 bits may overflow 32 bits

  5. If so, then:

  6. while(NEW_COUNTER_DATA > 2^32) {

    1. Increment COUNTER_DATA_OVERFLOWS by one.

    2. Subtract 2^32 bits from the overflowed value

    3. Write the difference to COUNTER_DATA


  7. If there was no overflow you don't need to do anything. Just store the new COUNTER_DATA.

Decoding the stored data:



  3. Print the following on the screen: COUNTER_DATA + ((2^32) * COUNTER_DATA_OVERFLOWS))

  4. The computer can still do math on the encoded format. I just showed you how to add to the encoded format above.

  5. Using this method VERY large counters can be stored efficiently.

  6. If one DECIMAL(65) of overflows is not enough, add another one COUNTER_DATA_OVERFLOWS2 which represents the number of 2^33 bits that were taken away. In other words overflow COUNTER_DATA_OVERFLOWS into COUNTER_DATA_OVERFLOWS2.

I call this "bit-length encoding".

I also have a working non-database example written in PHP here:

Why it works:
Algebra says that if you take something away from an equation on one side, you have to add it back on the other.  This method just keeps track of how many times a constant number has been taken away.  To un-encode the number you just need to multiply by the constant N times.  The method implemented is called "bag algebra".

The most important thing about this counter is that it is easily shows human readable numbers without really converting from machine representation.  That is, you can work with this "compressed" (encoded really) data directly, even adding or subtracting from it as long as you use the rules above.   You can do this without having to decompress it into as many bits as needed for the two's complement notation, so very large numbers don't need a long time to print or take a lot of space when they are transmitted over the network for aggregation.

This method easily allows the counter to be stored on more than one machine.  Just add the counters from each machine using 64 bit math, add the overflow counters, and then print the total representation, thus counter calculation is embarassingly parallel as long as the function providing the counter.

The hex datatype?
Anders posted a blog post about adding a new JSON data type to MariaDB. This made me remember my very first MySQL User conference, well, that and the next couple conferences actually. Why? Because at every conference (for awhile) I remember Brian Aker getting everybody into a room to discuss how to add a data type to MySQL. It was considered pretty much the quintessential example of how to start developing for MySQL because it touches a lot of the basic guts of the server.

The example that Brian always gave was a "hex" data type, so that you can store MD5 numbers using 128 bits of storage instead of 256. A human readable MD5 is a 256 bit string (32 characters) representing 128 bits of actual data. So storing MD5 values in the database (a very common thing) requires twice as much space as it should.

Now, the question I've always asked is, "Why did nobody actually implement the type?". It seems really useful. And it was always said to be trivial. So why doesn't it exist?

Perhaps because you can use a BINARY(16) column and store the raw value, converting between hex and binary? Or maybe just use a UDF for storage and retrieval?

Should a hex data type get added after all this time? How about a UUID data type? Especially with UUID being common, it could save a lot of space.

Archive vs Percona XtraDB vs Tokutek TokuDB LOAD DATA performance
Stewart blogged about the archive storage engine and he asked the Internet to test how fast ARCHIVE is compared with other MySQL storage engines like Percona XtraDB and Tokutek TokuDB. Since I've been curious about this subject as well, I decided to give it a quick test.
Read more...Collapse )

Not very compressible data

I took a star schema benchmark "lineorder" file and grabbed the first 15M rows. To each row I added a TEXT field called "extra_data". This field contains 10 random words from /usr/share/dict/words separated by space. This adds up to just about 3GB of raw input data.

Insert performance with no indexes (best to worst)

TokuDB: 187K rows/sec
Query OK, 15000000 rows affected (1 min 20.25 sec)

XtraDB (uncompressed): 119K rows/sec:
Query OK, 15000000 rows affected (2 min 5.40 sec)

Archive: 61K rows/sec:
Query OK, 15000000 rows affected (4 min 3.15 sec)

XtraDB Compressed key_block_size=8: 6K row/sec:
I cancelled after 6 mins, that is 2.2M rows of 15M rows.
---TRANSACTION 514, ACTIVE 365 sec inserting
1 lock struct(s), heap size 376, 0 row lock(s), undo log entries 2233162

Test with indexes

I added the following keys to the table:
alter table insert_test add key(LO_CustKey), add key(LO_OrderDateKey),
add key(LO_SuppKey), add key(LO_PartKey);

I then loaded the table as before to see how multiple indexes affect insert performance.

TokuDB: 84K rows/sec:
Query OK, 15000000 rows affected (2 min 58.42 sec)

InnoDB: 53K rows/sec:
Query OK, 15000000 rows affected (4 min 40.05 sec)

Insert performance with more highly compressible data

This time I took the first 5M rows, but instead of just adding 10 random words, I added the words in a repeating "phrase". For example: "word1 word2 ... word10 word1 word2 ... word10 word1 ...". This generated about the same amount of data.

TokuDB (no indexes): 92K rows/sec:
Query OK, 5000000 rows affected (45.34 sec)

TokuDB (with indexes): 66K rows/sec:
Query OK, 5000000 rows affected (1 min 14.89 sec)

XtraDB (no indexes): 65K rows/sec:
Query OK, 5000000 rows affected (1 min 16.75 sec)

Archive: 43K rows/sec:
Query OK, 5000000 rows affected (1 min 55.24 sec)

XtraDB (with indexes): 36.6K rows/sec:
Query OK, 5000000 rows affected (2 min 16.23 sec)

XtraDB Compressed key_block_size=8 (no indexes): 19.25K rows/sec:
Query OK, 5000000 rows affected (4 min 20.12 sec)

XtraDB Compressed key_block_size=8 (with indexes): 7.59K rows/sec:
Query OK, 5000000 rows affected (10 min 58.43 sec)

Building index on copy of the 15M row table (best to worst)

alter table sortbuild_test  add key(LO_CustKey), add key(LO_OrderDateKey),  add key(LO_SuppKey), add key(LO_PartKey);

[edit: Fixed order to put TokuDB first]
Query OK, 0 rows affected (1 min 40.55 sec)

Query OK, 0 rows affected (2 min 21.93 sec)

Building index on copy of the 5M row table (InnoDB 8k only)

XtraDB Compressed key_block_size=8:
Query OK, 0 rows affected (1 min 14.05 sec)

Data size comparison (smallest disk footprint to largest)

These numbers were calculated using INFORMATION_SCHEMA.TABLES which reports the UNCOMPRESSED size for TokuDB tables. I hope they fix this in a new release. See the comments for a look at the tokudb data directory contents.

| table_schema                 | table_name        | data_mb | idx_mb  | sum_size |
| archive_phrase               | insert_test       |  529.78 |    0.00 |   529.78 |
| archive                      | insert_test       | 1386.51 |    0.00 |  1386.51 |
| innodb_compressed_phrase     | insert_test_8k    | 1601.00 |    0.00 |  1601.00 |
| innodb_compressed_phrase_idx | sortbuild_test_8k | 1601.00 |  151.30 |  1752.30 |
| innodb_compressed_phrase_idx | insert_test_8k    | 1601.00 |  433.83 |  2034.83 |
| tokudb                       | insert_test       |    X    |    X    |     X    |
| tokudb_phrase                | insert_test       |    X    |    X    |     X    |
| tokudb_phrase_idx            | insert_test       |    X    |    X    |     X    |
| innodb                       | insert_test       | 3112.00 |    0.00 |  3112.00 |
| innodb_phrase                | insert_test       | 3202.00 |    0.00 |  3202.00 |
| tokudb_idx                   | insert_test       |    X    |    X    |     X    |
| tokudb_idx                   | sortbuild_test    |    X    |    X    |     X    |
| innodb_phrase_idx            | insert_test       | 3202.00 |  565.03 |  3767.03 |
| innodb_idx                   | sortbuild_test    | 3112.00 |  899.19 |  4011.19 |
| innodb_idx                   | insert_test       | 3112.00 | 1519.00 |  4631.00 |

The TokuDB test table is 1800MB (15M row table, 3G orig size)
The TokuDB test2 table is 2171MB (keys built with ALTER)
The tokudb_idx database is 2202MB (keys built with INSERT)
The tokudb_phrase database is 681MB (5M row, 3G orig size)
The tokudb_phrase_idx database is 806MB (5M row, keys built with INSERT)

Here is the CREATE TABLE for the data:
LO_OrderKey bigint not null,
LO_LineNumber tinyint not null,
LO_CustKey int not null,
LO_PartKey int not null,
LO_SuppKey int not null,
LO_OrderDateKey int not null,
LO_OrderPriority varchar(15),
LO_ShipPriority char(1),
LO_Quantity tinyint,
LO_ExtendedPrice decimal,
LO_OrdTotalPrice decimal,
LO_Discount decimal,
LO_Revenue decimal,
LO_SupplyCost decimal,
LO_Tax tinyint,
LO_CommitDateKey int not null,
LO_ShipMode varchar(10),
extra_data TEXT

For compression:

Here is the my.cnf for Percona XtraDB (not tuning for tokudb):
# Disabling symbolic-links is recommended to prevent assorted security risks




Test Hardware:
Intel i970-3.2GHz 6 core w/HT (12 threads)
LSI 9211-8i IT HBA
DB on software RAID 0 - Two Intel 520 SSD, one OCZ vertex.  300GB total.

Some LSI 9211-8i issues on Windows and Linux
Make sure you flash an LSI-9211 to IT firmware rev#14 to get it to work 
with Linux and SSD trim.  You may have to downgrade from newer firmware
to older firmware to get the card to work.

Read more...Collapse )

Finding a SATA III controller with more than one PCI-e lane

After a recent hardware issue I decided to upgrade my computer to use new Intel 520 120MB SSD drives in RAID for improved performance.  The motherboard I use (an ASUS Rampage III extreme) has a Marvel SATA III controller with two ports, but I discovered that it is connected via only a single PCI-e lane (each lane can do at most 400MB/sec*).  This means that it can't effectively support even a single Intel 520 because one device can saturate the SATA III bus (An Intel 520 is rated at up to 550MB/sec sequential write).

So I went on a quest for a new SATA 3 controller.   To Frys! I exclaimed.  But unfortunately, all the PCI-e 2.x SATA III controllers used a single lane!  (These cards also feature RAID which is a double wtf).  

Well, having been thwarted by my Frys run (I could have upgraded the motherboard, but I want to wait for the next generation of processors with DDR4 ram) I went to Central Computers in Mountain View.  There I found LSI PCI-e HBAs. LSI HBAs use 8 PCI-e lanes.  I decided to get an LSI 9211-8i for a little less than they sell on NewEgg.  This is a RAID controller, and I specifically told the technician that I wanted a RAID controller that supported TRIM and he said LSI supports trim**.  So I laid down my money and headed home.

Strange Windows 7 Professional 64 bit performance

I booted into Windows 7 and performance on AS-SSD (a simple benchmark utility targeted at SSD performance) for the drive was horrible.  So horrible the test could not complete. During the test the performance is reported to be pegged at 98 IOPS.  To compare, the same test, same drive, same OS, but different card: 13000 IOPS.  13K IOPS is low for the drive but it was an old sata 2 card.  Regardless, it shows the drive can do way better than the 98 IOPS that the LSI card reported while attempting to complete the test.  I messed around with this for quite awhile, pulling out hair.  I eventually tried upgrading the card to R16 firmware.  Same issue.  I figured that maybe there was some sort of windows weirdness so I decided to test Linux. I have contacted LSI about the issue with AS-SSD to see if there is possibly a driver issue to blame.

Linux and the case of the mysterious hanging mkfs.ext4

I installed CentOS 6.4 minimal on another 32GB OCZ SSD I had lying around.  CentOS found the LSI card just fine.  I could run a `dd` on the Intel 520 drives and they could do 500MB/sec+ sequential write.  Great.  Next I tried to format a filesystem on the device using `mkfs.ext4`.   ext4 now supports TRIM and will discard all of the blocks on a target when it creates a new filesystem. However, when I tried to create a filesystem with `mkfs.ext4`, it would simply hang at 0/X bytes discarded. Eventually the kernel started printing messages like "attempting task abort!" and "task abort: SUCCESS scmd" into the log.  

There was no way to kill `mkfs.ext4`.

I did some searching and found an old post to the OCZ forum about problems with an OCZ Vertex 3, which is sandforce based, same as the Intel 520.  When I was doing my tests, my Vertex 3 was on another SATA bus, isolated from the LSI so I think this may be SANDFORCE related.

If you didn't read the post, that user upgraded to firmware R14 to fix the problem. Unfortunately I had already upgraded to R16.  

Fixing the mkfs.ext4 issue

I decided to try to downflash to R14.  But of course, the flash utility won't let you downgrade by default.  However, I found a workaround by using a VERY DANGEROUS mode on the controller: RESET MODE. In order to use this function you must use the DOS flash method. I created a USB FreeDOS boot disk. The flash method may work from UEFI but UEFI is a PITA. It doesn't work from the Linux flash utility or the Windows flash utility.

To place the card in RESET MODE:
C:\sas2flsh.exe -o -e 6

Now you must flash the new firmware AND bios:
C:\>sas2flsh.exe -o -f FIRMWARE.bin -b BIOS.rom 

It is safe to reboot after flashing

With the LSI card flashed down to R14 I was able to create a filesystem on the device, but TRIM was not working.  /sys/block/sda/queue/discard_max_bytes was 0.

It turns out only LSI HBA initiator devices support trim.  So you can't use RAID and TRIM on this card or even TRIM on non-RAID devices (JBOD).  Sales guy sold me another lemon.  Oh well, software raid 0 isn't a large overhead.

Getting rid of raid (flash to IT mode)

Oh bugger, sas2flsh won't let you switch from IR (RAID) to IT (INITIATOR) firmware either without going into RESET mode.  Follow the same instructions above, but RESET the card then flash the IT firmware instead of IR.  The BIOS is the same in either case.

Final setup

I'm doing software RAID1 over a portion of the two Intel, swap on a portion of the OCZ, and software RAID0 over the remainder of all three.  This gives me a safe area for my OS and important files, a large high performance database test area, and the swap on the OCZ will be regarded as unused by TRIM.  This will improve garbage collection on the OCZ which was in use for some time before I bought the 520s.

* You will see it quoted as 5GT/s and/or 5Gbit/sec but that is before 8b/10b encoding is applied for the data transfer over the bus

** Turns out they only support trim for HBA Initiators, not HBA RAID. You can flash to IT firmware though. See the post.

*** This post isn't MySQL specific.


Log in