swanhart

The WARP storage engine beta: columnar storage for MySQL 8 with automatic bitmap indexing

Oracle MySQL is in need of a columnar storage engine for analytics workloads.  A columnar engine (or column store) stores data vertically, that is, it stores all the data associated with a column together, instead of the traditional RDBMS storage method of storing entire rows together, either in a index organized manner, like InnoDB, or in a heap, like MyISAM.  

Columnar storage has the benefit of reducing IO when only a subset of the row is accessed in a query, because only the data for the accessed rows must be read from disk (or cache) instead of having to read entire rows.  Most columnar stores do not support indexes, but WARP does.

WARP is open source

You can find the WARP source code release on GitHub.  Binaries can be provided upon request.  Simply open an issue for your desired Linux distribution, and I will make them available as soon as I can.

WARP is beta quality

WARP is based on Fastbit, which is currently version 2, and is used in production in a number of large scientific applications, such as grid computing analysis of particle accelerator data, working with genomic data, and other applications.  

WARP has been tested by a variety of alpha users.  It is likely that there are still bugs or missing features in the MySQL storage engine interface, thus it is not suggested to use WARP for production critical data.  It is suggested to test WARP against the same data in another storage engine to test for correctness.  

Bugs and feature requests can be reported on the GitHub issues page, at the GitHub link provided above.  

Support and consulting for WARP implementations is available through Swanhart Technical Services, as well as generic MySQL training and consulting services.  I will provide information about those options in another blog post.

Bitmap Indexing

While columnar storage is uncommon to open source SQL RDBMs, bitmap indexing is not available at all.  Bitmap indexes have characteristics that make them ideal for queries that traditional btree indexes can not answer efficiently (or at all), but they are not sorted, so they do not provide all of the same capabilities of btree indexes, such as the ability to provide pre-calculated sorting.  

WARP provides both columnar storage and automatic bitmap indexing of columns used in filters.  The end user doesn't have to pick which specific columns to index.  Compressed bitmap indexes are automatically created to support the queries run against the database.  It is possible to exclude columns from automatic indexing if desired.  

The WARP storage engine

WARP SE is an acronym for Word Aligned Relational Partitioned Storage Engine.  WARP is a storage engine for MySQL 8.0, currently supporting MySQL 8.0.19.  Data is partitioned vertically, and horizontally, automatically.  

WARP does not currently support distributing data over more than one node.  WARP FS is a distributed file system in development to support this feature, but it is not currently available for public testing (yet).

The WARP engine is based on an existing "nosql" database called Fastbit (https://sdm.lbl.gov/fastbit/).  Fastbit is an open source columnar engine which features WAH (word aligned hybrid) compressed bitmap indexes.  WARP is based on a slightly modified version of the Fastbit 2.0.3 code base.  When you compile WARP, the Fastbit source code is automatically retrieved, patched, and compiled, as part of the build process.

Technically pluggable

The WARP engine does not require any modification to the MySQL source code, thus it is technically a pluggable engine, but the current source tree compiles it into the server like the InnoDB storage engine.  This is because MySQL currently changes the storage engine interface in minor releases, so older versions of plugins, even for the same major release, may no longer work in later point releases.  This is unfortunate, but that is a topic for another blog post.

WARP storage engine in detail

For my examples, I am going to use the Star Schema Benchmark test data.  The Star Schema Benchmark test data creates a STAR schema.  STAR schema are very common in data marts.  The fact table is usually a very large table, and dimension tables are smaller "lookup" tables that are joined to the fact table.

The following table is the "fact" table:

The DDL associated with the LINEORDER table of the Star Schema Benchmark
The DDL associated with the LINEORDER table of the Star Schema Benchmark

A note about indexes

You do not need to define indexes ahead of time

You will note that the table does not have any indexes defined on it.  Indexes are automatically created and maintained transparently by WARP when a column is utilized in a filter (WHERE clause).  

It is, however, possible to explicitly define indexes for columns, which will then allow an EXPLAIN plan to show index usage.  Keep in mind, however, that WARP may use indexes (even combining multiple indexes) in ways MySQL is not capable of doing so, so the EXPLAIN plan does not always accurately represent the execution plan of the WARP storage engine.  It is possible to exclude columns from being indexed automatically, but that will be discussed in a future blog post.

You do not need indexes for joins in MySQL 8

MySQL 8.0.19+ supports HASH joins, which are efficient for joining tables, and counter-intuitively, using bitmap indexes for joins may result in slower queries than using HASH joins, so it is not recommended to explicitly create indexes on columns that are going to be used in joins.  Doing so will cause MySQL to choose to use nested loop joins instead of HASH joins.  

If it is desirable to have a UNIQUE key in a dimension table, it is suggested to store the column twice, once without KEY and once with, and to join on the column without the key. This limitation may change in a future version of WARP.

Updates are not "in place"

WARP does not update column data in place.  When a row is deleted, the deleted RID is marked in the sparse bitmap (described below).  When a row is updated, the original version of the row is marked as deleted, and a new row is appended to the table.  This means that the old version of a row may be in a different partition than the new version.  Rebuilding a table with OPTIMIZE TABLE or ALTER TABLE .. FORCE will remove the deleted rows.  

The primary reason for this is that WAH compressed bitmaps are extremely expensive to update, but disk space is relatively cheap.  The second reason is that strings are stored adjacent to each other in a column, and increasing the length of a string would require relocating the whole row anyway.  

Filling in "holes" is fairly simple in a row store, but much more complicated in a column store.

WARP RIDs (row ids)

RID stands for "rowid".  Every row in a WARP table has an associated RID.  Note that RID values are not monotonic.  A table may contain gaps in RID values, thus a WARP rid is a logical row address.  RID values are 64 bit integers.

The physical offset into a column in a partition is the physical RID of the row in that partition.  But since a table may have many partitions, it is not possible to refer to a physical RID from the storage layer.  If a table is modified or rebuilt with ALTER TABLE, the deleted rows are removed, and physical RID values will change, but logical RID values will not.

RID values are not visible or usable from the MySQL storage layer.  They are only used internally by the storage engine.

Data in the data directory

MySQL stores data on disk in a "data directory".  Each schema (aka database) is a directory in the data directory.  In this case, the tables for the ssb_warp schema are located in /data/mysql_data/ssb_warp.

The .sdi file is conceptually similar to the .FRM file that would be found in older versions of MySQL, prior to MySQL 8.  MySQL 8 has a transactional data dictionary, and has eliminated .FRM files.  

Files and directories inside of the table directory

The table data itself is stored on disk in a table directory associated with the table.  In this case the data for the "lineorder" table is located in /data/mysql_data/ssb_warp/lineorder.data

Inside of the datafile directory, you will find additional directories which represent each partition of the table.  WARP automatically partitions tables, by default, into horizontal partitions of a maximum size of one million rows.  This is currently configurable on a per-instance, not per-table basis.  

WARP generally reads an entire column in a partition into memory at once.  Every non-empty table has at minimum, one partition.

In addition to the partitions, which are directories  you will find a "-part.txt" file and a "deleted.rids" file.  

-part.txt

The "-part.txt" file is a .ini like file which contains information about the table, including the data types for each column.  You must not delete or modify this file manually.  

The "-part.txt" file also includes the index specifications for bitmap indexes.  WARP supports a wide variety of different bitmap indexing methods.  The default index method is suitable for a wide variety of data, but it is possible to choose the bitmap index characteristics via a comment on a column in the table DDL.  This will be explored in a future blog post.

deleted.rids sparse bitmap file

The "deleted.rids" file is a simple sparse bitmap which indicates which rows in the table have been deleted.  Simple sparse bitmaps are stored on disk, and can represent a bitmap of any size (up to the maximum file size on the filesytem) and do not need to be read completely into memory for querying or modification.

Each RID is represented by one bit in the sparse bitmap.  This bitmap is not WAH compressed because updating WAH bitmaps is very expensive.  Instead it is a sparse file which means the OS will only use space in the bitmap for any blocks in which a deleted RID is written.  Updates to the sparse bitmap are logged with a WAL (write ahead log) and are atomic.

If you have a 1 billion row table with only one deleted row, the "deleted.rid" file will appear to be large, but will in fact only take up 4K (or your filesystem block size) of space on disk.  Take care when backing up and restoring data, to preserve sparse files.

The sparse bitmap header-only C++ library is available for use in other projects here.

Column data files

WARP stores each column in an individual data file.  The data in the column is analogous to a raw C array.  On disk, each column data file is named after the ordinal position of the column in the table, staring with zero, and prefixed by the letter c.  Thus, a 17 column table will have sixteen data files c0 through c16.

Files in a partition
Files in a partition

The "r" file is the column which contains the WARP RID.  This column is fetched with every query, and each row RID is compared against the deleted rid simple sparse bitmap and skipped if the row has been marked as deleted.

NULL marker data files

NULL markers are stored in data files named nX, where X is an integer representing the ordinal position of the column.

Because a SQL RDBMS supports the concept of NULL values, when a column is NULLABLE, the NULL marker is created to store a 1 if the column is null, or a 0 if it is not.  When a column value is NULL, a 0 is stored in column data file, and a 1 is stored in the "null marker".  

The NULL marker is essentially a hidden TINYINT column, thus NULL values take up the storage size of the column, plus one extra byte.  This is necessary because Fastbit does not support NULL values in the traditional RDBMS sense.  

The IS NULL and IS NOT NULL operators query the NULL markers, and bitmap indexes are automatically created for the markers to support these operations when necessary.

Index files

When WARP creates indexes, these files are stored in additional files, depending on if the data is character or integer based.  For integer data, there will be one ".idx" file, named cX.idx where the X is again the ordinal position of the column in the table.  For data stored as strings, "cX.sp" files are created, again named after the ordinal position of the column.  It is possible to delete ".idx" or ".sp" files.  They will be recreated automatically as necessary.

Locking and ACID compliance

WARP currently uses table level locking, and supports concurrent inserts, and querying data while loading.  WARP is partially ACID compliant.  Loading data commits periodically, and committed data is visible to new queries.  WARP does not support transactions (yet).  

It is recommended to add an integer column to tables representing a batch of data, and if the data loading fails, or is interrupted, to delete the affected batch.  The next version of WARP will support atomic loading so that this will no longer be necessary.  

DELETE and UPDATE queries block other queries.  DELETE and UPDATE queries are uncommon in analytics systems.  WARP does support DELETE and UPDATE so that "slowly changing dimensions" are supported, which is not possible on column stores like ClickHouse.

DELETE and UPDATE operations are not in place.  Old row versions are maintained in the table until it is rebuilt with OPTIMIZE TABLE.Thus WARP is somewhere between MyISAM and InnoDB in terms of ACID compliance, and the goal is full ACID compliance in a future release. 

A quick look at performance and automatic indexing.

The lineorder table referenced above has ~70M rows:

A COUNT(*) query with no WHERE clause
A COUNT(*) query with no WHERE clause

The LO_OrderKey and LO_SuppKey columns are integers.  Each order has on average 7 line items, but the number of lines differ (the SSB data is based on the TPC-H data generator).

WARP reads only the necessary column data from disk

You will notice that the following query is a lot faster than the COUNT(*) query.  This is because only the LO_OrderKey column had to be read from disk.  This is because WARP is a column store.  In addition, an index was automatically constructed on the LO_OrderKey column.

A filter is used on the LO_OrderKey column which constructs an index
A filter is used on the LO_OrderKey column which constructs an index

Automatic indexing for filtered columns

The first time you query a column and use a filter (WHERE clause) on it, IO will be done to read the data from disk, and an index will be automatically constructed.  Here is a filter for a single order:

A 39K index is created for the 8.0 megabytes of data in partition 0
A 39K index is created for the 8.0 megabytes of data in partition 0

You will notice that the index is small.  WARP uses WAH compressed bitmaps. 

This is the first partition in the table, but almost ever partition (save the last) contains the same amount of data because each partition has 1M rows.  Thus the size of the LO_OrderKey on disk is ~544MB, but the size of the index is only 4.67MB.  

A btree index in InnoDB requires the whole column to be copied (and sorted) along with the primary key for the row.  So an InnoDB index on this same data would be ~577MB.

Read free index updates

When a row is appended to a table only the last block of the index has to be updated, thus inserting new rows into the table does not incur the performance penalty of maintaining btree indexes.  Indexes use very little space, and are negligible to insertion and modification performance.

Querying the column again with the automatic index

The bitmap index will be used the next time a filter is applied to the LO_OrderKey column.  This query selects a different order from the table:

Searching for LO_OrderKey = 100 uses the indexes with a 0.03 second response
Searching for LO_OrderKey = 100 uses the indexes with a 0.03 second response

WARP didn't have to do any IO for the query, because the data is already cached from the previous query.  The size of the cache is configurable and defaults to 4GB.  But more importantly, the bitmap index was utilized, and instead of scanning the whole table (which takes 55 seconds) the query is resolved in 0.03 seconds.

Unlike most MySQL engines, WARP can use more than one index to answer a query

Bitmap indexes support bitmap intersection for fast query results that can not be satisfied with btree indexes.  MySQL does support "index intersect" operations in some cases, but they are still not as fast as bitmap index intersection, and every btree index slows down data manipulation operations significantly.

I first query the LO_SuppKey column to create an index on it. Note, I could just run the final query that references both columns, but again, I want to demonstrate the indexing performance and show that the intersection returns the right results.

It takes 26.34 seconds to read the column from disk and construct the index
It takes 26.34 seconds to read the column from disk and construct the index

I will use a disjunction (an OR clause) to combine results from both of the indexes. 

There are no rows with both LO_OrderKey = 100 and LO_SuppKey = 9988, thus the query result should include the 2974 rows where LO_SuppKey = 9988, and the five rows where LO_OrderKey = 100 rows, and indeed it does:

Bitmap intersection returns results in 0.05 seconds for an OR query that btree indexes can not efficiently resolve
Bitmap intersection returns results in 0.05 seconds for an OR query that btree indexes can not efficiently resolve

Conclusion

This concludes the introduction to the WARP storage engine.  Questions and comments are welcome.  Again, if you would like binaries for your preferred Linux distribution, open a GitHub issue requesting them, and I will make them available.


Error

default userpic

Your reply will be screened

Your IP address will be recorded 

When you submit the form an invisible reCAPTCHA check will be performed.
You must follow the Privacy Policy and Google Terms of use.