One of the key moments using HP Vertica is an optimization of data storage using encoding. Encoding reduces disk space usage, making it faster to get huge amounts of data using on-the-fly decoding.
We can suggest optimal encoding type based on data type, cardinality and encoding algorithm intensity.
In this article I will investigate disk storage consumption for timestamp type depending on precision and encoding type.
A text file has been generated for the tests. It contains 43195502 lines in timestamp format with 6 digits precision generated for a one day. Each millisecond has a record with 50% probability. Average 500 lines per second.
"18.11.15 00:00:00,001261" "18.11.15 00:00:00,002797" "18.11.15 00:00:00,004778" "18.11.15 00:00:00,005283" ... "18.11.15 23:59:59,995350" "18.11.15 23:59:59,999563"
A few tables with different precisions and encoding types was created as well.
create table test_ts_AUTO(t timestamp encoding AUTO) order by t; create table test_ts_BLOCK_DICT(t timestamp encoding BLOCK_DICT) order by t; create table test_ts_BLOCKDICT_COMP(t timestamp encoding BLOCKDICT_COMP) order by t; create table test_ts_BZIP_COMP(t timestamp encoding BZIP_COMP) order by t; create table test_ts_COMMONDELTA_COMP(t timestamp encoding COMMONDELTA_COMP) order by t; create table test_ts_DELTARANGE_COMP(t timestamp encoding DELTARANGE_COMP) order by t; create table test_ts_DELTAVAL(t timestamp encoding DELTAVAL) order by t; create table test_ts_GCDDELTA(t timestamp encoding GCDDELTA) order by t; create table test_ts_GZIP_COMP(t timestamp encoding GZIP_COMP) order by t; create table test_ts_RLE(t timestamp encoding RLE) order by t; create table test_ts3_AUTO(t timestamp(3) encoding AUTO) order by t; create table test_ts3_BLOCK_DICT(t timestamp(3) encoding BLOCK_DICT) order by t; create table test_ts3_BLOCKDICT_COMP(t timestamp(3) encoding BLOCKDICT_COMP) order by t; create table test_ts3_BZIP_COMP(t timestamp(3) encoding BZIP_COMP) order by t; create table test_ts3_COMMONDELTA_COMP(t timestamp(3) encoding COMMONDELTA_COMP) order by t; create table test_ts3_DELTARANGE_COMP(t timestamp(3) encoding DELTARANGE_COMP) order by t; create table test_ts3_DELTAVAL(t timestamp(3) encoding DELTAVAL) order by t; create table test_ts3_GCDDELTA(t timestamp(3) encoding GCDDELTA) order by t; create table test_ts3_GZIP_COMP(t timestamp(3) encoding GZIP_COMP) order by t; create table test_ts3_RLE(t timestamp(3) encoding RLE) order by t; create table test_ts0_AUTO(t timestamp(0) encoding AUTO) order by t; create table test_ts0_BLOCK_DICT(t timestamp(0) encoding BLOCK_DICT) order by t; create table test_ts0_BLOCKDICT_COMP(t timestamp(0) encoding BLOCKDICT_COMP) order by t; create table test_ts0_BZIP_COMP(t timestamp(0) encoding BZIP_COMP) order by t; create table test_ts0_COMMONDELTA_COMP(t timestamp(0) encoding COMMONDELTA_COMP) order by t; create table test_ts0_DELTARANGE_COMP(t timestamp(0) encoding DELTARANGE_COMP) order by t; create table test_ts0_DELTAVAL(t timestamp(0) encoding DELTAVAL) order by t; create table test_ts0_GCDDELTA(t timestamp(0) encoding GCDDELTA) order by t; create table test_ts0_GZIP_COMP(t timestamp(0) encoding GZIP_COMP) order by t; create table test_ts0_RLE(t timestamp(0) encoding RLE) order by t;
Data was bulk inserted from external file-based table.
insert into test_ts_AUTO select t from original; insert into test_ts_BLOCK_DICT select t from original; insert into test_ts_BLOCKDICT_COMP select t from original; insert into test_ts_BZIP_COMP select t from original; insert into test_ts_COMMONDELTA_COMP select t from original; insert into test_ts_DELTARANGE_COMP select t from original; insert into test_ts_DELTAVAL select t from original; insert into test_ts_GCDDELTA select t from original; insert into test_ts_GZIP_COMP select t from original; insert into test_ts_RLE select t from original; insert into test_ts3_AUTO select t from original; insert into test_ts3_BLOCK_DICT select t from original; insert into test_ts3_BLOCKDICT_COMP select t from original; insert into test_ts3_BZIP_COMP select t from original; insert into test_ts3_COMMONDELTA_COMP select t from original; insert into test_ts3_DELTARANGE_COMP select t from original; insert into test_ts3_DELTAVAL select t from original; insert into test_ts3_GCDDELTA select t from original; insert into test_ts3_GZIP_COMP select t from original; insert into test_ts3_RLE select t from original; insert into test_ts0_AUTO select t from original; insert into test_ts0_BLOCK_DICT select t from original; insert into test_ts0_BLOCKDICT_COMP select t from original; insert into test_ts0_BZIP_COMP select t from original; insert into test_ts0_COMMONDELTA_COMP select t from original; insert into test_ts0_DELTARANGE_COMP select t from original; insert into test_ts0_DELTAVAL select t from original; insert into test_ts0_GCDDELTA select t from original; insert into test_ts0_GZIP_COMP select t from original; insert into test_ts0_RLE select t from original;
Results:
timestamp(6) | timestamp(3) | timestamp(0) | |
AUTO | 82.9 MB | 77.786 MB | 41.666 MB |
BLOCKDICT_COMP | 437.993 MB | 390.953 MB | 22.167 MB |
BLOCK_DICT | 329.958 MB | 329.958 MB | 26.88 MB |
BZIP_COMP | 152.099 MB | 124.857 MB | 1.768 MB |
COMMONDELTA_COMP | 114.521 MB | 13.427 MB | 0.479 MB |
DELTARANGE_COMP | 69.383 MB | 62.281 MB | 1.491 MB |
DELTAVAL | 124.057 MB | 124.062 MB | 125.982 MB |
GCDDELTA | 124.101 MB | 75.033 MB | 26.264 MB |
GZIP_COMP | 111.654 MB | 96.027 MB | 1.504 MB |
RLE | 207.763 MB | 191.337 MB | 0.838 MB |
Timestamp precision matters, obviously.
For 0 precision it’s better to use COMMONDELTA_COMP or RLE.
For 3 – COMMONDELTA_COMP.
For 6 – DELTARANGE_COMP or AUTO.
IMPORTANT
Always keep in mind column’s cardinality. In our example we had 500 rows per second load. Encoding algorithm can be much or less optimal based on your system load.