My SQL Dump

MySQL musings by a self professed MySQL Geek

Previous Entry Share Next Entry
A new big data structure for streaming counters - bit length encoding
swanhart
One of the challenges of big data is that it is, well, big. Computers are optimized for math on 64 bits or less. Any bigger, and extra steps have to be taken to work with the data which is very expensive. This is why a BIGINT is 64 bits.  In MySQL DECIMAL can store more than 64 bits of data using fixed precision.  Large numbers can use FLOAT or DECIMAL but those data types are lossy.

DECIMAL is an expensive encoding. Fixed precision math is expensive and you eventually run out of precision at which point you can't store any more data, right?

What happens when you want to store a counter that is bigger than the maximum DECIMAL?  FLOAT is lossy.  What if you need an /exact/ count of a very big number without using very much space?

I've developed an encoding method that allows you to store very large counters in a very small amount of space. It takes advantage of the fact that counters are always increasing.  We can take advantage of the fact that the counter always increases by also choosing an arbitrary bit size in which to do math.  If this sounds confusing, the example below should be simple to understand.

Decide on a large counter (total number of bits ever downloaded from wikipedia for example).  There are two goals that have to be met if the counter is to be manageable.

  1. The data must be stored in a compact way so that it doesn't take a long time to read the counter value or write the counter value

  2. The data must be stored in such a way that a counter with a million digits does not have to print 1 million digits on the screen, but that can instead use an alternate notation.

  3. You see this with floating point operations all the time, which remove decimal places by using powers of 10.

  4. That idea can be extended to any counter using POWERS OF TWO notation without ever losing precision.

  5. The data must be stored in a way that allows further computation by the machine in the new representation.

Storing the data in the database:


  1. Define a counter.  We'll call it bits_on_the_internet.

  2. In the database use two columns.  One is an INTcolumn called COUNTER_DATA and the other a DECIMAL(60,0) column called COUNTER_DATA_OVERFLOWS.

  3. Use 64 bit math to add the size (in bits) of each document to the counter and store in memory as NEW_COUNTER_DATA.

  4. The new value stored in 64 bits may overflow 32 bits

  5. If so, then:

  6. while(NEW_COUNTER_DATA > 2^32) {


    1. Increment COUNTER_DATA_OVERFLOWS by one.

    2. Subtract 2^32 bits from the overflowed value

    3. Write the difference to COUNTER_DATA


    }

  7. If there was no overflow you don't need to do anything. Just store the new COUNTER_DATA.


Decoding the stored data:

  1. Read COUNTER_DATA

  2. Read COUNTER_DATA_OVERFLOWS

  3. Print the following on the screen: COUNTER_DATA + ((2^32) * COUNTER_DATA_OVERFLOWS))

  4. The computer can still do math on the encoded format. I just showed you how to add to the encoded format above.

  5. Using this method VERY large counters can be stored efficiently.

  6. If one DECIMAL(65) of overflows is not enough, add another one COUNTER_DATA_OVERFLOWS2 which represents the number of 2^33 bits that were taken away. In other words overflow COUNTER_DATA_OVERFLOWS into COUNTER_DATA_OVERFLOWS2.


I call this "bit-length encoding".

I also have a working non-database example written in PHP here:
https://www.facebook.com/notes/justin-swanhart/php-script-to-bit-length-encode-streaming-counters/10151445872671472

Why it works:
Algebra says that if you take something away from an equation on one side, you have to add it back on the other.  This method just keeps track of how many times a constant number has been taken away.  To un-encode the number you just need to multiply by the constant N times.  The method implemented is called "bag algebra".

The most important thing about this counter is that it is easily shows human readable numbers without really converting from machine representation.  That is, you can work with this "compressed" (encoded really) data directly, even adding or subtracting from it as long as you use the rules above.   You can do this without having to decompress it into as many bits as needed for the two's complement notation, so very large numbers don't need a long time to print or take a lot of space when they are transmitted over the network for aggregation.

This method easily allows the counter to be stored on more than one machine.  Just add the counters from each machine using 64 bit math, add the overflow counters, and then print the total representation, thus counter calculation is embarassingly parallel as long as the function providing the counter.


You are viewing swanhart