HP Vertica. Storing timestamp data depending on precision and encoding type.

HP_Vertica_Logo

One of the key moments using HP Vertica is an optimization of data storage using encoding. Encoding reduces disk space usage, making it faster to get huge amounts of data using on-the-fly decoding.
We can suggest optimal encoding type based on data type, cardinality and encoding algorithm intensity.

In this article I will investigate disk storage consumption for timestamp type depending on precision and encoding type.

A text file has been generated for the tests. It contains 43195502 lines in timestamp format with 6 digits precision generated for a one day. Each millisecond has a record with 50% probability. Average 500 lines per second.

"18.11.15 00:00:00,001261"
"18.11.15 00:00:00,002797"
"18.11.15 00:00:00,004778"
"18.11.15 00:00:00,005283"
...
"18.11.15 23:59:59,995350"
"18.11.15 23:59:59,999563"

A few tables with different precisions and encoding types was created as well.

create table test_ts_AUTO(t timestamp encoding AUTO) order by t;
create table test_ts_BLOCK_DICT(t timestamp encoding BLOCK_DICT) order by t;
create table test_ts_BLOCKDICT_COMP(t timestamp encoding BLOCKDICT_COMP) order by t;
create table test_ts_BZIP_COMP(t timestamp encoding BZIP_COMP) order by t;
create table test_ts_COMMONDELTA_COMP(t timestamp encoding COMMONDELTA_COMP) order by t;
create table test_ts_DELTARANGE_COMP(t timestamp encoding DELTARANGE_COMP) order by t;
create table test_ts_DELTAVAL(t timestamp encoding DELTAVAL) order by t;
create table test_ts_GCDDELTA(t timestamp encoding GCDDELTA) order by t;
create table test_ts_GZIP_COMP(t timestamp encoding GZIP_COMP) order by t;
create table test_ts_RLE(t timestamp encoding RLE) order by t;

create table test_ts3_AUTO(t timestamp(3) encoding AUTO) order by t;
create table test_ts3_BLOCK_DICT(t timestamp(3) encoding BLOCK_DICT) order by t;
create table test_ts3_BLOCKDICT_COMP(t timestamp(3) encoding BLOCKDICT_COMP) order by t;
create table test_ts3_BZIP_COMP(t timestamp(3) encoding BZIP_COMP) order by t;
create table test_ts3_COMMONDELTA_COMP(t timestamp(3) encoding COMMONDELTA_COMP) order by t;
create table test_ts3_DELTARANGE_COMP(t timestamp(3) encoding DELTARANGE_COMP) order by t;
create table test_ts3_DELTAVAL(t timestamp(3) encoding DELTAVAL) order by t;
create table test_ts3_GCDDELTA(t timestamp(3) encoding GCDDELTA) order by t;
create table test_ts3_GZIP_COMP(t timestamp(3) encoding GZIP_COMP) order by t;
create table test_ts3_RLE(t timestamp(3) encoding RLE) order by t;

create table test_ts0_AUTO(t timestamp(0) encoding AUTO) order by t;
create table test_ts0_BLOCK_DICT(t timestamp(0) encoding BLOCK_DICT) order by t;
create table test_ts0_BLOCKDICT_COMP(t timestamp(0) encoding BLOCKDICT_COMP) order by t;
create table test_ts0_BZIP_COMP(t timestamp(0) encoding BZIP_COMP) order by t;
create table test_ts0_COMMONDELTA_COMP(t timestamp(0) encoding COMMONDELTA_COMP) order by t;
create table test_ts0_DELTARANGE_COMP(t timestamp(0) encoding DELTARANGE_COMP) order by t;
create table test_ts0_DELTAVAL(t timestamp(0) encoding DELTAVAL) order by t;
create table test_ts0_GCDDELTA(t timestamp(0) encoding GCDDELTA) order by t;
create table test_ts0_GZIP_COMP(t timestamp(0) encoding GZIP_COMP) order by t;
create table test_ts0_RLE(t timestamp(0) encoding RLE) order by t;

Data was bulk inserted from external file-based table.

insert into test_ts_AUTO select t from original;
insert into test_ts_BLOCK_DICT select t from original;
insert into test_ts_BLOCKDICT_COMP select t from original;
insert into test_ts_BZIP_COMP select t from original;
insert into test_ts_COMMONDELTA_COMP select t from original;
insert into test_ts_DELTARANGE_COMP select t from original;
insert into test_ts_DELTAVAL select t from original;
insert into test_ts_GCDDELTA select t from original;
insert into test_ts_GZIP_COMP select t from original;
insert into test_ts_RLE select t from original;

insert into test_ts3_AUTO select t from original;
insert into test_ts3_BLOCK_DICT select t from original;
insert into test_ts3_BLOCKDICT_COMP select t from original;
insert into test_ts3_BZIP_COMP select t from original;
insert into test_ts3_COMMONDELTA_COMP select t from original;
insert into test_ts3_DELTARANGE_COMP select t from original;
insert into test_ts3_DELTAVAL select t from original;
insert into test_ts3_GCDDELTA select t from original;
insert into test_ts3_GZIP_COMP select t from original;
insert into test_ts3_RLE select t from original;

insert into test_ts0_AUTO select t from original;
insert into test_ts0_BLOCK_DICT select t from original;
insert into test_ts0_BLOCKDICT_COMP select t from original;
insert into test_ts0_BZIP_COMP select t from original;
insert into test_ts0_COMMONDELTA_COMP select t from original;
insert into test_ts0_DELTARANGE_COMP select t from original;
insert into test_ts0_DELTAVAL select t from original;
insert into test_ts0_GCDDELTA select t from original;
insert into test_ts0_GZIP_COMP select t from original;
insert into test_ts0_RLE select t from original;

Results:

timestamp(6) timestamp(3) timestamp(0)
AUTO 82.9 MB 77.786 MB 41.666 MB
BLOCKDICT_COMP 437.993 MB 390.953 MB 22.167 MB
BLOCK_DICT 329.958 MB 329.958 MB 26.88 MB
BZIP_COMP 152.099 MB 124.857 MB 1.768 MB
COMMONDELTA_COMP 114.521 MB 13.427 MB 0.479 MB
DELTARANGE_COMP 69.383 MB 62.281 MB 1.491 MB
DELTAVAL 124.057 MB 124.062 MB 125.982 MB
GCDDELTA 124.101 MB 75.033 MB 26.264 MB
GZIP_COMP 111.654 MB 96.027 MB 1.504 MB
RLE 207.763 MB 191.337 MB 0.838 MB

Timestamp precision matters, obviously.
For 0 precision it’s better to use COMMONDELTA_COMP or RLE.
For 3 – COMMONDELTA_COMP.
For 6 – DELTARANGE_COMP or AUTO.

IMPORTANT
Always keep in mind column’s cardinality. In our example we had 500 rows per second load. Encoding algorithm can be much or less optimal based on your system load.

Leave a Reply

Your email address will not be published. Required fields are marked *