GROUP_CONCAT is very slow. So I used FastBit for the table instead! 43 seconds down to .16!

MySQL always tries to make toast of my good ideas. This is the only time is succeeds in making toast (see bug #2). |

This time I'm working on star schema optimzation (which will be my next blog post). MySQL takes about .24 seconds to project out a list of parts that match a filter, but it takes 43 (not .43, FORTY-THREE) seconds to turn it into a CSV list using GROUP_CONCAT.

  select P_PartKey from part where where P_MFGR = 'MFGR#1' or P_MFGR='MFGR#2';
  399597 rows in set (0.24 sec)

  mysql> select group_concat(P_PartKey) into @pkey from part where P_MFGR = 'MFGR#1' or   P_MFGR='MFGR#2';
  Query OK, 1 row affected (43.25 sec)

So I decided to just stick the part table into FastBit instead (using my FastBit_UDF tools):
fb_create('/var/lib/fastbit/part', 'P_PartKey:int,P_Name:text,P_MFGR:text,P_Category:text,P_Brand:text,P_Colour:text,P_Type:text,P_Size:text,P_Container:text');

-- this loads the table, which is one time only
select fb_insert2('/var/lib/fastbit/part',P_PartKey, P_Name, P_MFGR, P_Category, P_Brand, P_Colour, P_Type, P_Size, P_Container) from part group by 1;
Query OK, 1 row affected (15 seconds)

-- create the same CSV list using fb_inlist():
select fastbit.fb_inlist('/var/lib/fastbit/part',"select P_PartKey where P_MFGR='MFGR#1' or P_MFGR='MFGR#2'") into @plist;
Query OK, 1 row affected (0.16 sec)

I think .16 seconds is a lot better than 43! :D

You can find the Fastbit_UDF tools here:

ASYNC native asynchronous query mechanism for MySQL updated for 5.6 and support non-SELECT statement

Async, part of the Swanhart Toolkit ( has been updated to fully support 5.6 (the GET_LOCK I used only worked with new 5.7 GET_LOCK semantics) and it now also supports INSERT/UPDATE/DELETE/CREATE/CALL/etc (anything that can be prepared). You can only get a resultset back from SELECT statements even if CALL statements return a resultset. This is a MySQL limitation (no CALL xyx(..) AS table_name)

Now available in swanhart-tools: NATIVE asynchronous query execution for any MySQL client!

There is often a need to run queries in the background in MySQL. This generally is accomplished using a message queue (like gearman), or by using extensions to a client (PHP has such extensions) or by using MariaDB's native async query interface (C client only I think).

While such solutions work well, they don't work for say a GO client, or for the mysql command line client itself.

I present "async"; part of the Swanhart Toolkit ( Async is a stored procedure and event based solution for asynchronous queries.

It consists of:

  • A queue table to hold the SQL to run, the state of execution, error messages, etc

  • A settings table that controls the number of parallel threads to use for executing queries

  • A stored routine worker that picks up SQL queries and executes them

  • An EVENT that runs the worker threads (which exits if too many are running)

  • SQL stored routines for dropping SQL into the queue (async.queue), checking the status of the query(async.check), and getting the result of a query (async.wait).

It works with MySQL 5.6 or greater (you can make it work with 5.5 if you get rid of GET DIAGNOSTICS with the tradeoff that you won't have good error messages for your broken queries.

Just run the setup.sql script, enable the event scheduler, then:
call async.queue('select sleep(50)');  -- returns the query number (also in @query_number)
call async.check(@query_number)\G
call async.wait(@query_number);

-- eight workers run by default (update the settings table to change that)
mysql> show processlist;
| Id   | User            | Host      | db    | Command | Time | State                       | Info             |
|  143 | root            | localhost | async | Query   |    0 | User sleep                  | DO SLEEP(v_wait) |
|  146 | event_scheduler | localhost | NULL  | Daemon  |    0 | Waiting for next activation | NULL             |
|  147 | root            | localhost | async | Connect |    0 | User sleep                  | DO SLEEP(v_wait) |
|  148 | root            | localhost | async | Connect |    0 | User sleep                  | DO SLEEP(v_wait) |
|  149 | root            | localhost | async | Connect |    0 | User sleep                  | DO SLEEP(v_wait) |
|  150 | root            | localhost | async | Connect |    0 | User sleep                  | DO SLEEP(v_wait) |
|  151 | root            | localhost | async | Connect |    0 | User sleep                  | DO SLEEP(v_wait) |
|  152 | root            | localhost | async | Connect |    0 | User sleep                  | DO SLEEP(v_wait) |
| 1525 | root            | localhost | async | Query   |    0 | starting                    | show processlist |
9 rows in set (0.00 sec)

-- queue a background query
mysql> call async.queue('select sleep(25)');
|            1 |
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

-- check the status
mysql> call async.check(@query_number)\G
*************************** 1. row ***************************
        q_id: 1
    sql_text: select sleep(25)
  created_on: 2016-02-24 14:09:30
  started_on: 2016-02-24 14:09:30
completed_on: NULL
      parent: NULL
   completed: 0
       state: RUNNING
       errno: NULL
      errmsg: NULL
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

-- wait for a query to finish (or just get the result of a finished query)
mysql> call async.wait(@query_number);
| sleep(25) |
|         0 |
1 row in set (10.18 sec)

Query OK, 0 rows affected (10.18 sec)

-- run a query that generates an error
mysql> call async.queue('select sleep(25) from bad_table');
|            2 |
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

-- get the error information for the bad query
mysql> call async.check(@query_number)\G
*************************** 1. row ***************************
        q_id: 2
    sql_text: select sleep(25) from bad_table
  created_on: 2016-02-24 14:10:04
  started_on: 2016-02-24 14:10:04
completed_on: 2016-02-24 14:10:04
      parent: NULL
   completed: 1
       state: ERROR
       errno: 42S02
      errmsg: Table 'async.bad_table' doesn't exist
1 row in set (0.01 sec)

Query OK, 0 rows affected (0.01 sec)

Advanced JSON for MySQL: indexing and aggregation for highly complex JSON documents

What is JSON

JSON is an text based, human readable format for transmitting data between systems, for serializing objects and for storing document store data for documents that have different attributes/schema for each document. Popular document store databases use JSON (and the related BSON) for storing and transmitting data.

Problems with JSON in MySQL

It is difficult to inter-operate between MySQL and MongoDB (or other document databases) because JSON has traditionally been very difficult to work with. Up until recently, JSON is just a TEXT document. I said up until recently, so what has changed? The biggest thing is that there are new JSON UDF by Sveta Smirnova, which are part of the MySQL 5.7 Labs releases. Currently the JSON UDF are up to version 0.0.4. While these new UDF are a welcome edition to the MySQL database, they don't solve the really tough JSON problems we face.

The JSON UDF provide a number of functions that make working with JSON easier, including the ability to extract portions of a document, or search a document for a particular key. That being said, you can't use JSON_EXTRACT() or JSON_SEARCH in the WHERE clause, because it will initiate a dreaded full-table-scan (what MongoDB would call a full collection scan). This is a big problem and common wisdom is that JSON can't be indexed for efficient WHERE clauses, especially sub-documents like arrays or objects within the JSON.

Actually, however, I've come up with a technique to effectively index JSON data in MySQL (to any depth). The key lies in transforming the JSON from a format that is not easily indexed into one that is easily indexed. Now, when you think index you think B-TREE or HASH indexes (or bitmap indexes) but MySQL also supports FULLTEXT indexes.

A fulltext index is an inverted index where words (tokens) point to documents. While text indexes are great, they aren't normally usable for JSON. The reason is, MySQL splits words on whitespace and non-alphanumeric characters. A JSON document doesn't end up being usable when the name of the field (the key) can't be associated with the value. But what if we transform the JSON? You can "flatten" the JSON down into key/value pairs and use a text index to associate the key/value pairs with the document. I created a UDF called RAPID_FLATTEN_JSON using the C++ Rapid JSON library. The UDF flattens JSON documents down into key/value pairs for the specific purpose of indexing.

Here is an example JSON document:
	"id": "0001",
	"type": "donut",
	"name": "Cake",
	"ppu": 0.55,
					{ "id": "1001", "type": "Regular" },
					{ "id": "1002", "type": "Chocolate" },
					{ "id": "1003", "type": "Blueberry" },
					{ "id": "1004", "type": "Devil's Food" }
			{ "id": "5001", "type": "None" },
			{ "id": "5002", "type": "Glazed" },
			{ "id": "5005", "type": "Sugar" },
			{ "id": "5007", "type": "Powdered Sugar" },
			{ "id": "5006", "type": "Chocolate with Sprinkles" },
			{ "id": "5003", "type": "Chocolate" },
			{ "id": "5004", "type": "Maple" }

mysql> select RAPID_FLATTEN_JSON(load_file('/tmp/doc.json'))\G
*************************** 1. row ***************************
RAPID_FLATTEN_JSON(load_file('/tmp/doc.json')): id=0001
type=Devil's Food
type=Powdered Sugar
type=Chocolate with Sprinkles
1 row in set (0.00 sec)

Obviously this is useful, because our keys are now attached to our values in an easily searchable way. All you need to do is store the flattened version of the JSON in another field (or another table), and index it with a FULLTEXT index to make it searchable. But wait, there is one more big problem: MySQL will split words on the equal sign. We don't want this as it removes the locality of the keyword and the value. To fix this problem you'll have to undertake the (actually quite easy) step of adding a new collation to MySQL (I called mine ft_kvpair_ci). I added equal (=) to the list of lower case characters as described in the manual. You just have to change two text files, no need to recompile the server or anything, and as I said, it is pretty easy. Let me know if you get stuck on this step and I can show you the 5.6.22 files I modified.

By the way, I used a UDF, because MySQL FULLTEXT indexes don't support pluggable parsers for InnoDB until 5.7. This will be much cleaner in 5.7 with a parser plugin and there will be no need to maintain an extra column.

Using the solution:
Given a table full of complex json:
create table json2(id int auto_increment primary key, doc mediumtext);

Add a column for the index data and FULLTEXT index it:
alter table json2 add flat mediumtext character set latin1 collate ft_kvpair_ci, FULLTEXT(flat);

Then populate the index. Note that you can create a trigger to keep the second column in sync, I let that up to an exercise of the reader, or you can use Flexviews to maintain a copy in a second table automatically.
mysql> update json2 set flat=RAPID_FLATTEN_JSON(doc);
Query OK, 18801 rows affected (26.34 sec)
Rows matched: 18801  Changed: 18801  Warnings: 0

Using the index:
mysql> select count(*) from json2 where match(flat) against ('last_name=Vembu');
| count(*) |
|        3 |
1 row in set (0.00 sec)

The documents I searched for that example are very complex and highly nested. Check out the full matching documents for the query here here

If you want to only index a subportion of the document, use the MySQL UDF JSON_EXTRACT to extract the portion you want to index, and only flatten that.

JSON documents may contain sub-documents as mentioned a moment ago. JSON_EXTRACT can extract a portion of a document, but it is still a text document. There is no function that can extract ALL of a particular key (like invoice_price) and aggregate the results. So, if you have a document called orders which contains a varying number of items and their prices, it is very difficult (if not impossible) to use the JSON UDF to aggregate a "total sales" figure from all the order documents.

To solve this problem, I created another UDF called RAPID_EXTRACT_ALL(json, 'key'). This UDF will extract all the values for the given key. For example, if there are 10 line items with invoice_id: 30, it will extract the value (30 in this case) for each item. This UDF returns each item separated by newline. I created a few stored routines called jsum, jmin, jmax, jcount, and javg. They can process the output of rapid_extract_all and aggregate it. If you want to only RAPID_EXTRACT_ALL from a portion of a document, extract that portion with the MySQL UDF JSON_EXTRACT first, then process that with RAPID_EXTRACT_ALL.

For example:
mysql> select json_extract_all(doc,'id') ids, jsum(json_extract_all(doc,'id')) from json2 limit 1\G    
*************************** 1. row ***************************
ids: 888

jsum(json_extract_all(doc,'id')): 18660.00000
1 row in set (0.01 sec)

Aggregating all of the id values in the entire collection:
mysql> select sum( jsum(json_extract_all(doc,'id')) ) from json2 ;
| sum( jsum(json_extract_all(doc,'id')) ) |
|                         296615411.00000 |
1 row in set (2.90 sec)

Of course you could extract other fields and sort and group on them.

Where to get the tools:
You can find the UDF in the swanhart-tools github repo. I think you will find these tools very useful in working with JSON documents in MySQL.

Take the long view on the MySQL PERFORMANCE_SCHEMA with ps_history and sys_history

The performance_schema is a powerful tool for analyzing MySQL performance and behavior. One aspect of the performance_schema is that the view of the data is "right now", and very little historical information is present. You can see that there are 10 threads connected right now, but what about five minutes ago?

ps_history is a set of stored routines and events for periodically collecting the data in the performance_schema into another schema called ps_history. The ps_history schema contains a copy of each performance_schema view as a real table, and timestamp and server_id columns have been added to each table. Periodically (by default every 30 seconds) the performance_schema data is written into the history tables.

ps_history comes as one script (setup.sql) which will create the ps_history schema, the tables within it, and create an event for collecting data. To collect data into the history tables you must either enable the event scheduler (set global event_scheduler=1) or you must periodically run the ps_history.collect() stored routine.

ps_history automatically removes old data from the history tables. By default 1 week of history will be retained. You can change the retention period with ps_history.set_retention_period(), for example to keep 2 weeks of history use ps_history.set_retention_period('2 WEEK'); to keep 10 days, ps_history.set_retention_period('10 DAY');

Here is an example from the hosts table:
mysql> select * from ps_history.hosts limit 4\G
*************************** 1. row ***************************
               HOST: localhost
          server_id: 33
                 ts: 2015-02-09 13:17:16.68339
*************************** 2. row ***************************
               HOST: NULL
          server_id: 33
                 ts: 2015-02-09 13:17:16.68339
*************************** 3. row ***************************
               HOST: localhost
          server_id: 33
                 ts: 2015-02-09 13:17:47.69614
*************************** 4. row ***************************
               HOST: NULL
          server_id: 33
                 ts: 2015-02-09 13:17:47.69614
4 rows in set (0.00 sec)

sys_history is a modified version of the SYS schema to use the data from the ps_history tables instead of the performance_schema. Each of the views in the SYS schema have been modified to include the timestamp (ts) field from the ps_history tables, and joins have been modified to include ts in the join.

Here is an example of one of the aggregated views:
mysql> select * from io_by_thread_by_latency limit 4\G
*************************** 1. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: page_cleaner_thread
         total: 4809257
 total_latency: 00:08:13.93
   min_latency: 66.98 ns
   avg_latency: 1.02 ms
   max_latency: 99.29 ms
     thread_id: 18
processlist_id: NULL
*************************** 2. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: io_handler_thread
         total: 1081585
 total_latency: 00:04:44.69
   min_latency: 432.51 ns
   avg_latency: 263.22 us
   max_latency: 36.52 ms
     thread_id: 8
processlist_id: NULL
*************************** 3. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: io_handler_thread
         total: 412302
 total_latency: 00:01:34.81
   min_latency: 604.82 ns
   avg_latency: 229.94 us
   max_latency: 69.59 ms
     thread_id: 11
processlist_id: NULL
*************************** 4. row ***************************
            ts: 2015-02-09 13:17:16.68339
          user: io_handler_thread
         total: 368534
 total_latency: 00:01:32.18
   min_latency: 589.63 ns
   avg_latency: 250.12 us
   max_latency: 37.70 ms
     thread_id: 10
processlist_id: NULL
4 rows in set (5.91 sec)

Working with comma separated list MySQL options

Over time, some options have crept into the MySQL server which are comma separated lists of options. These include SQL_MODE, optimizer_switch, optimizer_trace and a few other variables.

Optimizer_switch is particularly problematic to work with as it contains many options and is hard to interpret. This is what you usually see when you examine optimizer_switch:


As you can see, seeing which option is on or off is rather difficult. You can use the REPLACE function to make this easier:
mysql> select replace(@@optimizer_switch, ',','\n')\G
*************************** 1. row ***************************
replace(@@optimizer_switch, ',','\n'): index_merge=on
1 row in set (0.00 sec)

This way you can use the \P mysql monitor option to search for key/value pairs:
mysql> \P grep mrr
PAGER set to 'grep mrr'
mysql> select replace(@@optimizer_switch, ',','\n')\G
1 row in set (0.00 sec)

The hex datatype?

Anders posted a blog post about adding a new JSON data type to MariaDB. This made me remember my very first MySQL User conference, well, that and the next couple conferences actually. Why? Because at every conference (for awhile) I remember Brian Aker getting everybody into a room to discuss how to add a data type to MySQL. It was considered pretty much the quintessential example of how to start developing for MySQL because it touches a lot of the basic guts of the server.

The example that Brian always gave was a "hex" data type, so that you can store MD5 numbers using 128 bits of storage instead of 256. A human readable MD5 is a 256 bit string (32 characters) representing 128 bits of actual data. So storing MD5 values in the database (a very common thing) requires twice as much space as it should.

Now, the question I've always asked is, "Why did nobody actually implement the type?". It seems really useful. And it was always said to be trivial. So why doesn't it exist?

Perhaps because you can use a BINARY(16) column and store the raw value, converting between hex and binary? Or maybe just use a UDF for storage and retrieval?

Should a hex data type get added after all this time? How about a UUID data type? Especially with UUID being common, it could save a lot of space.

Archive vs Percona XtraDB vs Tokutek TokuDB LOAD DATA performance

Stewart blogged about the archive storage engine and he asked the Internet to test how fast ARCHIVE is compared with other MySQL storage engines like Percona XtraDB and Tokutek TokuDB. Since I've been curious about this subject as well, I decided to give it a quick test.
Collapse )

Not very compressible data

I took a star schema benchmark "lineorder" file and grabbed the first 15M rows. To each row I added a TEXT field called "extra_data". This field contains 10 random words from /usr/share/dict/words separated by space. This adds up to just about 3GB of raw input data.

Insert performance with no indexes (best to worst)

TokuDB: 187K rows/sec
Query OK, 15000000 rows affected (1 min 20.25 sec)

XtraDB (uncompressed): 119K rows/sec:
Query OK, 15000000 rows affected (2 min 5.40 sec)

Archive: 61K rows/sec:
Query OK, 15000000 rows affected (4 min 3.15 sec)

XtraDB Compressed key_block_size=8: 6K row/sec:
I cancelled after 6 mins, that is 2.2M rows of 15M rows.
---TRANSACTION 514, ACTIVE 365 sec inserting
1 lock struct(s), heap size 376, 0 row lock(s), undo log entries 2233162

Test with indexes

I added the following keys to the table:
alter table insert_test add key(LO_CustKey), add key(LO_OrderDateKey),
add key(LO_SuppKey), add key(LO_PartKey);

I then loaded the table as before to see how multiple indexes affect insert performance.

TokuDB: 84K rows/sec:
Query OK, 15000000 rows affected (2 min 58.42 sec)

InnoDB: 53K rows/sec:
Query OK, 15000000 rows affected (4 min 40.05 sec)

Insert performance with more highly compressible data

This time I took the first 5M rows, but instead of just adding 10 random words, I added the words in a repeating "phrase". For example: "word1 word2 ... word10 word1 word2 ... word10 word1 ...". This generated about the same amount of data.

TokuDB (no indexes): 92K rows/sec:
Query OK, 5000000 rows affected (45.34 sec)

TokuDB (with indexes): 66K rows/sec:
Query OK, 5000000 rows affected (1 min 14.89 sec)

XtraDB (no indexes): 65K rows/sec:
Query OK, 5000000 rows affected (1 min 16.75 sec)

Archive: 43K rows/sec:
Query OK, 5000000 rows affected (1 min 55.24 sec)

XtraDB (with indexes): 36.6K rows/sec:
Query OK, 5000000 rows affected (2 min 16.23 sec)

XtraDB Compressed key_block_size=8 (no indexes): 19.25K rows/sec:
Query OK, 5000000 rows affected (4 min 20.12 sec)

XtraDB Compressed key_block_size=8 (with indexes): 7.59K rows/sec:
Query OK, 5000000 rows affected (10 min 58.43 sec)

Building index on copy of the 15M row table (best to worst)

alter table sortbuild_test  add key(LO_CustKey), add key(LO_OrderDateKey),  add key(LO_SuppKey), add key(LO_PartKey);

[edit: Fixed order to put TokuDB first]
Query OK, 0 rows affected (1 min 40.55 sec)

Query OK, 0 rows affected (2 min 21.93 sec)

Building index on copy of the 5M row table (InnoDB 8k only)

XtraDB Compressed key_block_size=8:
Query OK, 0 rows affected (1 min 14.05 sec)

Data size comparison (smallest disk footprint to largest)

These numbers were calculated using INFORMATION_SCHEMA.TABLES which reports the UNCOMPRESSED size for TokuDB tables. I hope they fix this in a new release. See the comments for a look at the tokudb data directory contents.

| table_schema                 | table_name        | data_mb | idx_mb  | sum_size |
| archive_phrase               | insert_test       |  529.78 |    0.00 |   529.78 |
| archive                      | insert_test       | 1386.51 |    0.00 |  1386.51 |
| innodb_compressed_phrase     | insert_test_8k    | 1601.00 |    0.00 |  1601.00 |
| innodb_compressed_phrase_idx | sortbuild_test_8k | 1601.00 |  151.30 |  1752.30 |
| innodb_compressed_phrase_idx | insert_test_8k    | 1601.00 |  433.83 |  2034.83 |
| tokudb                       | insert_test       |    X    |    X    |     X    |
| tokudb_phrase                | insert_test       |    X    |    X    |     X    |
| tokudb_phrase_idx            | insert_test       |    X    |    X    |     X    |
| innodb                       | insert_test       | 3112.00 |    0.00 |  3112.00 |
| innodb_phrase                | insert_test       | 3202.00 |    0.00 |  3202.00 |
| tokudb_idx                   | insert_test       |    X    |    X    |     X    |
| tokudb_idx                   | sortbuild_test    |    X    |    X    |     X    |
| innodb_phrase_idx            | insert_test       | 3202.00 |  565.03 |  3767.03 |
| innodb_idx                   | sortbuild_test    | 3112.00 |  899.19 |  4011.19 |
| innodb_idx                   | insert_test       | 3112.00 | 1519.00 |  4631.00 |

The TokuDB test table is 1800MB (15M row table, 3G orig size)
The TokuDB test2 table is 2171MB (keys built with ALTER)
The tokudb_idx database is 2202MB (keys built with INSERT)
The tokudb_phrase database is 681MB (5M row, 3G orig size)
The tokudb_phrase_idx database is 806MB (5M row, keys built with INSERT)

Here is the CREATE TABLE for the data:
LO_OrderKey bigint not null,
LO_LineNumber tinyint not null,
LO_CustKey int not null,
LO_PartKey int not null,
LO_SuppKey int not null,
LO_OrderDateKey int not null,
LO_OrderPriority varchar(15),
LO_ShipPriority char(1),
LO_Quantity tinyint,
LO_ExtendedPrice decimal,
LO_OrdTotalPrice decimal,
LO_Discount decimal,
LO_Revenue decimal,
LO_SupplyCost decimal,
LO_Tax tinyint,
LO_CommitDateKey int not null,
LO_ShipMode varchar(10),
extra_data TEXT

For compression:

Here is the my.cnf for Percona XtraDB (not tuning for tokudb):
# Disabling symbolic-links is recommended to prevent assorted security risks




Test Hardware:
Intel i970-3.2GHz 6 core w/HT (12 threads)
LSI 9211-8i IT HBA
DB on software RAID 0 - Two Intel 520 SSD, one OCZ vertex.  300GB total.

Some LSI 9211-8i issues on Windows and Linux

Make sure you flash an LSI-9211 to IT firmware rev#14 to get it to work 
with Linux and SSD trim.  You may have to downgrade from newer firmware
to older firmware to get the card to work.

Collapse )

Finding a SATA III controller with more than one PCI-e lane

After a recent hardware issue I decided to upgrade my computer to use new Intel 520 120MB SSD drives in RAID for improved performance.  The motherboard I use (an ASUS Rampage III extreme) has a Marvel SATA III controller with two ports, but I discovered that it is connected via only a single PCI-e lane (each lane can do at most 400MB/sec*).  This means that it can't effectively support even a single Intel 520 because one device can saturate the SATA III bus (An Intel 520 is rated at up to 550MB/sec sequential write).

So I went on a quest for a new SATA 3 controller.   To Frys! I exclaimed.  But unfortunately, all the PCI-e 2.x SATA III controllers used a single lane!  (These cards also feature RAID which is a double wtf).  

Well, having been thwarted by my Frys run (I could have upgraded the motherboard, but I want to wait for the next generation of processors with DDR4 ram) I went to Central Computers in Mountain View.  There I found LSI PCI-e HBAs. LSI HBAs use 8 PCI-e lanes.  I decided to get an LSI 9211-8i for a little less than they sell on NewEgg.  This is a RAID controller, and I specifically told the technician that I wanted a RAID controller that supported TRIM and he said LSI supports trim**.  So I laid down my money and headed home.

Strange Windows 7 Professional 64 bit performance

I booted into Windows 7 and performance on AS-SSD (a simple benchmark utility targeted at SSD performance) for the drive was horrible.  So horrible the test could not complete. During the test the performance is reported to be pegged at 98 IOPS.  To compare, the same test, same drive, same OS, but different card: 13000 IOPS.  13K IOPS is low for the drive but it was an old sata 2 card.  Regardless, it shows the drive can do way better than the 98 IOPS that the LSI card reported while attempting to complete the test.  I messed around with this for quite awhile, pulling out hair.  I eventually tried upgrading the card to R16 firmware.  Same issue.  I figured that maybe there was some sort of windows weirdness so I decided to test Linux. I have contacted LSI about the issue with AS-SSD to see if there is possibly a driver issue to blame.

Linux and the case of the mysterious hanging mkfs.ext4

I installed CentOS 6.4 minimal on another 32GB OCZ SSD I had lying around.  CentOS found the LSI card just fine.  I could run a `dd` on the Intel 520 drives and they could do 500MB/sec+ sequential write.  Great.  Next I tried to format a filesystem on the device using `mkfs.ext4`.   ext4 now supports TRIM and will discard all of the blocks on a target when it creates a new filesystem. However, when I tried to create a filesystem with `mkfs.ext4`, it would simply hang at 0/X bytes discarded. Eventually the kernel started printing messages like "attempting task abort!" and "task abort: SUCCESS scmd" into the log.  

There was no way to kill `mkfs.ext4`.

I did some searching and found an old post to the OCZ forum about problems with an OCZ Vertex 3, which is sandforce based, same as the Intel 520.  When I was doing my tests, my Vertex 3 was on another SATA bus, isolated from the LSI so I think this may be SANDFORCE related.

If you didn't read the post, that user upgraded to firmware R14 to fix the problem. Unfortunately I had already upgraded to R16.  

Fixing the mkfs.ext4 issue

I decided to try to downflash to R14.  But of course, the flash utility won't let you downgrade by default.  However, I found a workaround by using a VERY DANGEROUS mode on the controller: RESET MODE. In order to use this function you must use the DOS flash method. I created a USB FreeDOS boot disk. The flash method may work from UEFI but UEFI is a PITA. It doesn't work from the Linux flash utility or the Windows flash utility.

To place the card in RESET MODE:
C:\sas2flsh.exe -o -e 6

Now you must flash the new firmware AND bios:
C:\>sas2flsh.exe -o -f FIRMWARE.bin -b BIOS.rom 

It is safe to reboot after flashing

With the LSI card flashed down to R14 I was able to create a filesystem on the device, but TRIM was not working.  /sys/block/sda/queue/discard_max_bytes was 0.

It turns out only LSI HBA initiator devices support trim.  So you can't use RAID and TRIM on this card or even TRIM on non-RAID devices (JBOD).  Sales guy sold me another lemon.  Oh well, software raid 0 isn't a large overhead.

Getting rid of raid (flash to IT mode)

Oh bugger, sas2flsh won't let you switch from IR (RAID) to IT (INITIATOR) firmware either without going into RESET mode.  Follow the same instructions above, but RESET the card then flash the IT firmware instead of IR.  The BIOS is the same in either case.

Final setup

I'm doing software RAID1 over a portion of the two Intel, swap on a portion of the OCZ, and software RAID0 over the remainder of all three.  This gives me a safe area for my OS and important files, a large high performance database test area, and the swap on the OCZ will be regarded as unused by TRIM.  This will improve garbage collection on the OCZ which was in use for some time before I bought the 520s.

* You will see it quoted as 5GT/s and/or 5Gbit/sec but that is before 8b/10b encoding is applied for the data transfer over the bus

** Turns out they only support trim for HBA Initiators, not HBA RAID. You can flash to IT firmware though. See the post.

*** This post isn't MySQL specific.

Talks I would recommend at the Percona Live MySQL Conference and Expo

Time for some shameful self promotion and hopefully some helpful information too.

What follows are my lists of talks  at the MySQL conference that I would suggest to others, based on experience level, and/or interests.  I've focused on my interest areas, as others have posted lists with their interest areas already, but I also provide a list of those who may be new to the event and want suggestions for what I think will be great all-around talks with a wide focus.
Collapse )

If you are new to MySQL, or new to PCLME:
Day 1:

If you are experienced with MySQL and want to highlight interesting and new stuff:

Day 3:

If you are into sharding/scaling out:

If you are into high performance OLAP/analytics processing and/or "big data":