HP Vertica. Storing timestamp data depending on precision and encoding type.

One of the key moments using HP Vertica is an optimization of data storage using encoding. Encoding reduces disk space usage, making it faster to get huge amounts of data using on-the-fly decoding.
We can suggest optimal encoding type based on data type, cardinality and encoding algorithm intensity.

In this article I will investigate disk storage consumption for timestamp type depending on precision and encoding type.

A text file has been generated for the tests. It contains 43195502 lines in timestamp format with 6 digits precision generated for a one day. Each millisecond has a record with 50% probability. Average 500 lines per second.

"18.11.15 00:00:00,001261"
"18.11.15 00:00:00,002797"
"18.11.15 00:00:00,004778"
"18.11.15 00:00:00,005283"
...
"18.11.15 23:59:59,995350"
"18.11.15 23:59:59,999563"

A few tables with different precisions and encoding types was created as well.

create table test_ts_AUTO(t timestamp encoding AUTO) order by t;
create table test_ts_BLOCK_DICT(t timestamp encoding BLOCK_DICT) order by t;
create table test_ts_BLOCKDICT_COMP(t timestamp encoding BLOCKDICT_COMP) order by t;
create table test_ts_BZIP_COMP(t timestamp encoding BZIP_COMP) order by t;
create table test_ts_COMMONDELTA_COMP(t timestamp encoding COMMONDELTA_COMP) order by t;
create table test_ts_DELTARANGE_COMP(t timestamp encoding DELTARANGE_COMP) order by t;
create table test_ts_DELTAVAL(t timestamp encoding DELTAVAL) order by t;
create table test_ts_GCDDELTA(t timestamp encoding GCDDELTA) order by t;
create table test_ts_GZIP_COMP(t timestamp encoding GZIP_COMP) order by t;
create table test_ts_RLE(t timestamp encoding RLE) order by t;

create table test_ts3_AUTO(t timestamp(3) encoding AUTO) order by t;
create table test_ts3_BLOCK_DICT(t timestamp(3) encoding BLOCK_DICT) order by t;
create table test_ts3_BLOCKDICT_COMP(t timestamp(3) encoding BLOCKDICT_COMP) order by t;
create table test_ts3_BZIP_COMP(t timestamp(3) encoding BZIP_COMP) order by t;
create table test_ts3_COMMONDELTA_COMP(t timestamp(3) encoding COMMONDELTA_COMP) order by t;
create table test_ts3_DELTARANGE_COMP(t timestamp(3) encoding DELTARANGE_COMP) order by t;
create table test_ts3_DELTAVAL(t timestamp(3) encoding DELTAVAL) order by t;
create table test_ts3_GCDDELTA(t timestamp(3) encoding GCDDELTA) order by t;
create table test_ts3_GZIP_COMP(t timestamp(3) encoding GZIP_COMP) order by t;
create table test_ts3_RLE(t timestamp(3) encoding RLE) order by t;

create table test_ts0_AUTO(t timestamp(0) encoding AUTO) order by t;
create table test_ts0_BLOCK_DICT(t timestamp(0) encoding BLOCK_DICT) order by t;
create table test_ts0_BLOCKDICT_COMP(t timestamp(0) encoding BLOCKDICT_COMP) order by t;
create table test_ts0_BZIP_COMP(t timestamp(0) encoding BZIP_COMP) order by t;
create table test_ts0_COMMONDELTA_COMP(t timestamp(0) encoding COMMONDELTA_COMP) order by t;
create table test_ts0_DELTARANGE_COMP(t timestamp(0) encoding DELTARANGE_COMP) order by t;
create table test_ts0_DELTAVAL(t timestamp(0) encoding DELTAVAL) order by t;
create table test_ts0_GCDDELTA(t timestamp(0) encoding GCDDELTA) order by t;
create table test_ts0_GZIP_COMP(t timestamp(0) encoding GZIP_COMP) order by t;
create table test_ts0_RLE(t timestamp(0) encoding RLE) order by t;

Data was bulk inserted from external file-based table.

insert into test_ts_AUTO select t from original;
insert into test_ts_BLOCK_DICT select t from original;
insert into test_ts_BLOCKDICT_COMP select t from original;
insert into test_ts_BZIP_COMP select t from original;
insert into test_ts_COMMONDELTA_COMP select t from original;
insert into test_ts_DELTARANGE_COMP select t from original;
insert into test_ts_DELTAVAL select t from original;
insert into test_ts_GCDDELTA select t from original;
insert into test_ts_GZIP_COMP select t from original;
insert into test_ts_RLE select t from original;

insert into test_ts3_AUTO select t from original;
insert into test_ts3_BLOCK_DICT select t from original;
insert into test_ts3_BLOCKDICT_COMP select t from original;
insert into test_ts3_BZIP_COMP select t from original;
insert into test_ts3_COMMONDELTA_COMP select t from original;
insert into test_ts3_DELTARANGE_COMP select t from original;
insert into test_ts3_DELTAVAL select t from original;
insert into test_ts3_GCDDELTA select t from original;
insert into test_ts3_GZIP_COMP select t from original;
insert into test_ts3_RLE select t from original;

insert into test_ts0_AUTO select t from original;
insert into test_ts0_BLOCK_DICT select t from original;
insert into test_ts0_BLOCKDICT_COMP select t from original;
insert into test_ts0_BZIP_COMP select t from original;
insert into test_ts0_COMMONDELTA_COMP select t from original;
insert into test_ts0_DELTARANGE_COMP select t from original;
insert into test_ts0_DELTAVAL select t from original;
insert into test_ts0_GCDDELTA select t from original;
insert into test_ts0_GZIP_COMP select t from original;
insert into test_ts0_RLE select t from original;

Results:

 timestamp(6)timestamp(3)timestamp(0)
AUTO82.9 MB77.786 MB41.666 MB
BLOCKDICT_COMP437.993 MB390.953 MB22.167 MB
BLOCK_DICT329.958 MB329.958 MB26.88 MB
BZIP_COMP152.099 MB124.857 MB1.768 MB
COMMONDELTA_COMP114.521 MB13.427 MB0.479 MB
DELTARANGE_COMP69.383 MB62.281 MB1.491 MB
DELTAVAL124.057 MB124.062 MB125.982 MB
GCDDELTA124.101 MB75.033 MB26.264 MB
GZIP_COMP111.654 MB96.027 MB1.504 MB
RLE207.763 MB191.337 MB0.838 MB

Timestamp precision matters, obviously.
For 0 precision it’s better to use COMMONDELTA_COMP or RLE.
For 3 – COMMONDELTA_COMP.
For 6 – DELTARANGE_COMP or AUTO.

IMPORTANT
Always keep in mind column’s cardinality. In our example we had 500 rows per second load. Encoding algorithm can be much or less optimal based on your system load.

Leave a Comment

Your email address will not be published. Required fields are marked *