In GPDB 5x, table’s data can be distributed RANDOMLY
or by set of columns. The essential parts of column-based distribution are:
- hashing
- target segment number calculation
I will try to dig into them in this article.
To hash, or not to hash (on dispatcher), that is the question.
Based on logic and common sense, you may think GPDB always calculates hash on the dispatcher side and sends inserting tuples only to one, target, segment. The true is dispatcher may or may not do it depending on several conditions.
So, to pre-calculate hash and target segment number on dispatcher side, and then, send tuple only to target segment, the following conditions should be true.
- Postgres planner used.
- Single row inserted.
gp_enable_fast_sri
enabled.gp_enable_direct_dispatch
enabled.
That’s it. It means, when one of this conditions is not true, when you using ORCA for example, tuple will go to all segments.
But how GPDB avoid duplicates? That’s simple. GPDB filters them using TupleMatchesHashFilter()
function – tuples, which target segment number is not equal to current one, just will not be inserted.
Inserting to one pre-calculated segment is equal from segment point of view. The same check using TupleMatchesHashFilter()
function will be performed, just to be sure tuple belongs to current segment. The main difference – tuple will be sent only to one segment.
Typical TupleMatchesHashFilter() backtrace
#0 TupleMatchesHashFilter (resultSlot=, resultNode=) at nodeResult.c:232 #1 ExecResult (node=node@entry=0x55f8676767f0) at nodeResult.c:233 #2 0x000055f8653f2f56 in ExecProcNode (node=node@entry=0x55f8676767f0) at execProcnode.c:929 #3 0x000055f8653ef962 in ExecutePlan (estate=0x55f867676128, planstate=0x55f8676767f0, operation=CMD_INSERT, numberTuples=0, direction=, dest=0x55f8676800d8) at execMain.c:2900 #4 0x000055f8653f00e6 in ExecutorRun (queryDesc=queryDesc@entry=0x55f86798d058, direction=direction@entry=ForwardScanDirection, count=count@entry=0) at execMain.c:912 #5 0x000055f8655b9079 in ProcessQuery (portal=portal@entry=0x55f86767b158, stmt=stmt@entry=0x55f86766e1e8, params=, dest=dest@entry=0x55f8676800d8, completionTag=completionTag@entry=0x7ffd3d107290 "") at pquery.c:296 #6 0x000055f8655b96bb in PortalRunMulti (portal=portal@entry=0x55f86767b158, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x55f8676800d8, altdest=altdest@entry=0x55f8676800d8, completionTag=completionTag@entry=0x7ffd3d107290 "") at pquery.c:1467 #7 0x000055f8655baa2e in PortalRun (portal=portal@entry=0x55f86767b158, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x55f8676800d8, altdest=altdest@entry=0x55f8676800d8, completionTag=completionTag@entry=0x7ffd3d107290 "") at pquery.c:1029 #8 0x000055f8655b272e in exec_mpp_query ( query_string=query_string@entry=0x55f86766c466 "insert into t_test values('704ac2be373dafe8609dff66e23f325e','d17fdec1-ac94-4572-a844-9d54f21a081d');", serializedQuerytree=serializedQuerytree@entry=0x0, serializedQuerytreelen=serializedQuerytreelen@entry=0, serializedPlantree=serializedPlantree@entry=0x55f86766c4cc "7\002", serializedPlantreelen=serializedPlantreelen@entry=236, serializedParams=serializedParams@entry=0x0, serializedParamslen=0, serializedQueryDispatchDesc=0x55f86766c5b8 "s", serializedQueryDispatchDesclen=75, seqServerHost=0x55f86766c603 "127.0.1.1", seqServerPort=38837, localSlice=0) at postgres.c:1349 #9 0x000055f8655b7ee8 in PostgresMain (argc=, argv=argv@entry=0x55f867672e30, dbname=, username=) at postgres.c:5159 #10 0x000055f86554a19b in BackendRun (port=0x55f8676834b0) at postmaster.c:6732 #11 BackendStartup (port=0x55f8676834b0) at postmaster.c:6406 #12 ServerLoop () at postmaster.c:2444 #13 0x000055f86554bb28 in PostmasterMain (argc=, argv=) at postmaster.c:1528 #14 0x000055f8651d3340 in main (argc=12, argv=0x55f8676499b0) at main.c:206
From the dispatcher point of view, direct inserting to one target segment, obviously assumes hash calculation. directDispatchCalculateHash()
– hashing workhorse for direct dispatching to tables with non-random hashing policy.
Typical directDispatchCalculateHash() backtrace
#0 directDispatchCalculateHash (targetPolicy=0x556f259e1830, plan=0x556f259e0de8) at cdbmutate.c:730 #1 apply_motion (root=root@entry=0x556f259e0738, plan=plan@entry=0x556f259e0de8, query=query@entry=0x556f259e00d0) at cdbmutate.c:730 #2 0x0000556f2416f8af in cdbparallelize (root=0x556f259e0738, plan=plan@entry=0x556f259e0de8, query=query@entry=0x556f259e00d0, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at cdbllize.c:272 #3 0x0000556f23e93838 in standard_planner (parse=0x556f259e00d0, cursorOptions=0, boundParams=0x0) at planner.c:348 #4 0x0000556f23e93ac5 in planner (parse=parse@entry=0x556f2557d5c0, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at planner.c:163 #5 0x0000556f23f4a013 in pg_plan_query (cursorOptions=0, boundParams=0x0, querytree=0x556f2557d5c0) at postgres.c:914 #6 pg_plan_queries (querytrees=0x556f25649888, boundParams=boundParams@entry=0x0, needSnapshot=0 '\000', cursorOptions=0) at postgres.c:991 #7 0x0000556f23f4c234 in exec_simple_query ( query_string=query_string@entry=0x556f2557c478 "insert into t_test values('704ac2be373dafe8609dff66e23f325e','d17fdec1-ac94-4572-a844-9d54f21a081d');", seqServerHost=seqServerHost@entry=0x0, seqServerPort=seqServerPort@entry=-1) at postgres.c:1704 #8 0x0000556f23f4e55a in PostgresMain (argc=, argv=argv@entry=0x556f252d9c88, dbname=, username=) at postgres.c:4975 #9 0x0000556f23ee119b in BackendRun (port=0x556f252e1d10) at postmaster.c:6732 #10 BackendStartup (port=0x556f252e1d10) at postmaster.c:6406 #11 ServerLoop () at postmaster.c:2444 #12 0x0000556f23ee2b28 in PostmasterMain (argc=, argv=) at postmaster.c:1528 #13 0x0000556f23b6a340 in main (argc=15, argv=0x556f252b05f0) at main.c:206
Hash, hash, hash again, and umm… hash.
Two main goals of directDispatchCalculateHash()
function are:
- Calculate hash based on whole distribution key using multiple calls to
cdbhash()
. - Decide (based on total hash value) on which segment tuple should go to using
cdbhashreduce()
.
Let’s take a closer look at how it works.
At the very beginning, hash structure initialized by the following values:
- Number of segments.
- Reduction policy. If number of segments is a power of 2, then faster (
REDUCE_BITMASK
) will be used. If no – slower modulus (REDUCE_LAZYMOD
). - Random key for random distribution policy (not our case).
Additionally, magic hashing value applied as a base value for future hash calculation. This value is fixed and equal to (dec/hex/bit):
2166136261; 0x811c9dc5; 10000001000111001001110111000101
Right after that, distribution key hashing loop begins. Usually, it calls cdbhash()
function which calculates new hash value based on previously calculated one and on current Datum value.
hashDatum()
function used by cdbhash()
just prepares input Datum
for next processing, normalizing it (to int64
for integers; ignoring blank chars for text and others).
After normalizing done, it’s a time to calculate current Datum
‘s hash. addToCdbHash()
function used for that purposes. addToCdbHash()
, in turn, calls fnv1_32_buf()
function, which implements FNV-1 algorithm. Let’s take a look at this by example.
Imagine inserting the following tuple, which is all listed in distribution key and consist of two values:
('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d')
And this is how the first value looks like (dec/hex):
0x55f61a41bd5c: 49 102 54 54 52 101 100 51 0x55f61a41bd64: 101 101 53 52 97 57 99 55 0x55f61a41bd6c: 51 53 97 97 98 100 101 98 0x55f61a41bd74: 99 52 54 101 101 48 57 54 0x55f61a41bd5c: 0x31 0x66 0x36 0x36 0x34 0x65 0x64 0x33 0x55f61a41bd64: 0x65 0x65 0x35 0x34 0x61 0x39 0x63 0x37 0x55f61a41bd6c: 0x33 0x35 0x61 0x61 0x62 0x64 0x65 0x62 0x55f61a41bd74: 0x63 0x34 0x36 0x65 0x65 0x30 0x39 0x36
Looping over all the first value bytes, at first, we transforming hash value based on itself only using the following formula:
hval += (hval << 1) + (hval << 4) + (hval << 7) + (hval << 8) + (hval << 24);
Hash value (or base value at first iteration) plus it’s left shifted by 1, 4, 7, 8 and 24 bits values.
(hval << 1): 37305226; 0x2393b8a; 10001110010011101110001010 (hval << 4): 298441808; 0x11c9dc50; 10001110010011101110001010000 (hval << 7): 2387534464; 0x8e4ee280; 10001110010011101110001010000000 (hval << 8): 480101632; 0x1c9dc500; 11100100111011100010100000000 (hval << 24): 3305111552; 0xc5000000; 11000101000000000000000000000000
Which is 8674630943
in total, which is two times exceeds max uint32
value (4294967295
).
8674630943 - (4294967295 + 1) - (4294967295 + 1) = 84696351 #new hval 84696351; 0x50c5d1; 101000011000101110100011111
Next line of hashing algorithm is the line where the current Datum
begins taken to account. Bitwise XOR
with first Datum
‘s byte (49 - uint32; '1' - char
) transforms hval
to new value:
84696366; 0x50c5d2e; 101000011000101110100101110
After 32 loops (the length of first tuple Datum
) our hash looking like this:
3465832814; 0xce94696e; 11001110100101000110100101101110
At the next iteration, we starting to use this value as base for next Datum
‘s hash calculation.
Here is how second tuple’s value looks like (dec/hex):
0x55f61a41bf2c: 100 49 55 102 100 101 99 49 0x55f61a41bf34: 45 97 99 57 52 45 52 53 0x55f61a41bf3c: 55 50 45 97 56 52 52 45 0x55f61a41bf44: 57 100 53 52 102 50 49 97 0x55f61a41bf4c: 48 56 49 100 0x55f61a41bf2c: 0x64 0x31 0x37 0x66 0x64 0x65 0x63 0x31 0x55f61a41bf34: 0x2d 0x61 0x63 0x39 0x34 0x2d 0x34 0x35 0x55f61a41bf3c: 0x37 0x32 0x2d 0x61 0x38 0x34 0x34 0x2d 0x55f61a41bf44: 0x39 0x64 0x35 0x34 0x66 0x32 0x31 0x61 0x55f61a41bf4c: 0x30 0x38 0x31 0x64
At the 36th iteration (the length of second value) we begin the last calculation.
Left shifted values of hval
(3465832814
):
(hval << 1): 3348131412; 0xc7906e54; 11000111100100000110111001010100 (hval << 4): 1015247520; 0x3c8372a0; 111100100000110111001010100000 (hval << 7): 3827012864; 0xe41b9500; 11100100000110111001010100000000 (hval << 8): 3359058432; 0xc8372a00; 11001000001101110010101000000000 (hval << 24): 704643072; 0x2a000000; 101010000000000000000000000000
New hval
calculated using left shifted values:
13928159006 - (4294967295 + 1) - (4294967295 + 1) - (4294967295 + 1) = 1043257118 1043257118; 0x3e2ed71e; 111110001011101101011100011110
And finally, XOR
with last char of second value (100(uint32); 'd'(char)
).
1043257210; 0x3e2ed77a; 111110001011101101011101111010
This is our total hash value.
The last step is to calculate target segment number based on it. In case total number of segments on cluster is a power of 2 we do bitwise AND of hash and number of segments minus one. In the other cases we do simple modulus, which, in three-segment cluster case, gives the following result:
(h->hash) % (h->numsegs) = (1043257210) % (3) = 1
That’s it. Tuple will go to second (starting from 0) segment.
The Good, the Bad and the Modulus.
Despite the fact FNV-1 function is quite strong and works well, the “Modulus” can put a spoke in someone’s wheel.
Imagine we have the following set of tuples.
('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d') ('704ac2be373dafe8609dff66e23f325e','d17fdec1-ac94-4572-a844-9d54f21a081d') ('37f824e98ffe07d2b4e232fbb6006c8f','d17fdec1-ac94-4572-a844-9d54f21a081d') ('8bf56d72a0af20a4f00896f102a589da','d17fdec1-ac94-4572-a844-9d54f21a081d') ('a7088ec9c477bda70be15bfae1daf3c2','d17fdec1-ac94-4572-a844-9d54f21a081d')
That’s right, they all with the same second value, but this should not be a problem for “Good” hash function.
Here are the hash values calculated for them.
1043257210 2813991850 151371370 1588923970 2762693290
They different. Yes, may look similar cause they divides by 10 without remainder, but they different.
Let’s calculate target segment number. As mentioned above, there are two ways of calculating segment number based on hash whichever total segments count is power of 2 or not.
At left – hash value. At up – total count of segments. At center – target segment number. (Numbers are equal if calculate all using modulus, but I’ll keep tables separately.)
3 5 6 7 9 10 11 12 13 14 15 1043257210 1 0 4 2 7 0 6 10 8 2 10 2813991850 1 0 4 5 1 0 10 10 7 12 10 151371370 1 0 4 3 1 0 7 10 7 10 10 1588923970 1 0 4 4 7 0 7 10 10 4 10 2762693290 1 0 4 0 1 0 5 10 6 0 10
2 4 8 16 1043257210 0 2 2 10 2813991850 0 2 2 10 151371370 0 2 2 10 1588923970 0 2 2 2 2762693290 0 2 2 10
Whoopsie. Target segment number (at center) for relatively small count of total segments (at up) is always the same, no matter what hash number (at left) was used. Using of such “Bad”, or call them “unlucky“, tuples may ruin all distribution.
To say the true, your application will unlikely generate only such “unlucky” values. Even more, changing only last value’s char (1f664ed3ee54a9c735aabdebc46ee096
to 1f664ed3ee54a9c735aabdebc46ee097
) or removing any of chars will fix a problem.
BUT. If your application really generates only such values (IDK how, don’t ask me please), I can advise the following.
- Use another distribution key.
- Use wider distribution key with more columns.
- Use higher number of segments.
- Use odd or possibly prime (7+) number of segments.
Back to the future. Few words about 6x.
6x with set gp_use_legacy_hashops=on;
inherits 5x’s behavior with some differences in stack trace for directDispatchCalculateHash()
.
#0 directDispatchCalculateHash (hashfuncs=0x55def7ac7028, targetPolicy=0x55def7ac6980, plan=0x55def7bf1df8) at cdbmutate.c:149 #1 sri_optimize_for_result (root=root@entry=0x55def7bf1968, plan=plan@entry=0x55def7bf1df8, rte=rte@entry=0x55def7bf17b8, targetPolicy=targetPolicy@entry=0x7ffe079ed850, hashExprs_p=hashExprs_p@entry=0x7ffe079ed840, hashOpfamilies_p=hashOpfamilies_p@entry=0x7ffe079ed848) at cdbmutate.c:3576 #2 0x000055def046501d in adjust_modifytable_flow (is_split_updates=, node=, root=0x55def7bf1968) at createplan.c:6629 #3 make_modifytable (root=root@entry=0x55def7bf1968, operation=, canSetTag=, resultRelations=, subplans=subplans@entry=0x55def7ac68e8, withCheckOptionLists=withCheckOptionLists@entry=0x0, returningLists=, is_split_updates=, rowMarks=, epqParam=) at createplan.c:6492 #4 0x000055def0473771 in subquery_planner (glob=glob@entry=0x55def79cded8, parse=parse@entry=0x55def7bf1480, parent_root=parent_root@entry=0x0, hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=tuple_fraction@entry=0, subroot=subroot@entry=0x7ffe079ed948, config=) at planner.c:907 #5 0x000055def0473ad8 in standard_planner (parse=0x55def7bf1480, cursorOptions=0, boundParams=0x0) at planner.c:345 #6 0x000055def0474425 in planner (parse=parse@entry=0x55def79cdff0, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at planner.c:200 #7 0x000055def054a0f6 in pg_plan_query (boundParams=, cursorOptions=, querytree=) at postgres.c:971 #8 pg_plan_queries (querytrees=, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at postgres.c:1030 #9 0x000055def054aec4 in exec_simple_query ( query_string=0x55def79ccfd8 "insert into t_test values('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d');") at postgres.c:1766 #10 0x000055def054d5b9 in PostgresMain (argc=, argv=argv@entry=0x55def79ac720, dbname=, username=) at postgres.c:5290 #11 0x000055def04c048d in BackendRun (port=0x55def79d52e0) at postmaster.c:4811 #12 BackendStartup (port=0x55def79d52e0) at postmaster.c:4468 #13 ServerLoop () at postmaster.c:1948 #14 0x000055def04c14be in PostmasterMain (argc=6, argv=) at postmaster.c:1518 #15 0x000055def0114725 in main (argc=6, argv=0x55def79aa620) at main.c:245
And for TupleMatchesHashFilter()
.
#0 cdbhashreduce (h=0x557fbddd5380) at cdbhash.c:242 #1 0x0000557fb5a2f8fc in TupleMatchesHashFilter (node=0x557fbde94748, node=0x557fbde94748, resultSlot=0x557fbde93568) at nodeResult.c:279 #2 ExecResult (node=node@entry=0x557fbde94748) at nodeResult.c:229 #3 0x0000557fb59fefb8 in ExecProcNode (node=node@entry=0x557fbde94748) at execProcnode.c:970 #4 0x0000557fb5a2adfd in ExecModifyTable (node=node@entry=0x557fbde942f0) at nodeModifyTable.c:1745 #5 0x0000557fb59fefd8 in ExecProcNode (node=node@entry=0x557fbde942f0) at execProcnode.c:974 #6 0x0000557fb59f5201 in ExecutePlan (estate=estate@entry=0x557fbde92fd8, planstate=0x557fbde942f0, operation=operation@entry=CMD_INSERT, sendTuples=sendTuples@entry=0 '\000', numberTuples=numberTuples@entry=0, direction=direction@entry=ForwardScanDirection, dest=0x557fbddb78d0) at execMain.c:2956 #7 0x0000557fb59f5b5e in ExecutePlan (dest=0x557fbddb78d0, direction=ForwardScanDirection, numberTuples=0, sendTuples=0 '\000', operation=CMD_INSERT, planstate=, estate=0x557fbde92fd8) at execMain.c:2922 #8 standard_ExecutorRun (queryDesc=0x557fbde85848, direction=ForwardScanDirection, count=0) at execMain.c:983 #9 0x0000557fb5beda59 in ProcessQuery (portal=portal@entry=0x557fbde96ff8, stmt=stmt@entry=0x557fbddb6a58, params=, dest=dest@entry=0x557fbddb78d0, completionTag=completionTag@entry=0x7fff415acef0 "", sourceText=) at pquery.c:288 #10 0x0000557fb5bedfc5 in PortalRunMulti (portal=portal@entry=0x557fbde96ff8, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x557fbddb78d0, altdest=altdest@entry=0x557fbddb78d0, completionTag=completionTag@entry=0x7fff415acef0 "") at pquery.c:1473 #11 0x0000557fb5bef522 in PortalRun (portal=portal@entry=0x557fbde96ff8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x557fbddb78d0, altdest=altdest@entry=0x557fbddb78d0, completionTag=completionTag@entry=0x7fff415acef0 "") at pquery.c:1018 #12 0x0000557fb5be7ef3 in exec_mpp_query ( query_string=0x557fbddb61d0 "insert into t_test values('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d');", serializedQuerytree=, serializedQuerytreelen=, serializedPlantree=, serializedPlantreelen=, serializedParams=, serializedParamslen=28, serializedQueryDispatchDesc=0x557fbddb638f "(\265/\375 {\245\002", serializedQueryDispatchDesclen=93) at postgres.c:1383 #13 0x0000557fb5bec646 in PostgresMain (argc=, argv=argv@entry=0x557fbdd95a70, dbname=, username=) at postgres.c:5436 #14 0x0000557fb5b5f48d in BackendRun (port=0x557fbddbe5e0) at postmaster.c:4811 #15 BackendStartup (port=0x557fbddbe5e0) at postmaster.c:4468 #16 ServerLoop () at postmaster.c:1948 #17 0x0000557fb5b604be in PostmasterMain (argc=5, argv=) at postmaster.c:1518 #18 0x0000557fb57b3725 in main (argc=5, argv=0x557fbdd939e0) at main.c:245
ORCA still can’t dispatch insert to target segment only and all four conditions for such dispatch still the same.
As for jump hashing, main differences are here, here (with differences in Datum
’s hashing algorithms), and here, but generally, backtrace is the same.