About distribution, hashing and target segment number calculation in GPDB 5x.

In GPDB 5x, table’s data can be distributed RANDOMLY or by set of columns. The essential parts of column-based distribution are:

  • hashing
  • target segment number calculation

I will try to dig into them in this article.

To hash, or not to hash (on dispatcher), that is the question.

Based on logic and common sense, you may think GPDB always calculates hash on the dispatcher side and sends inserting tuples only to one, target, segment. The true is dispatcher may or may not do it depending on several conditions.
So, to pre-calculate hash and target segment number on dispatcher side, and then, send tuple only to target segment, the following conditions should be true.

  • Postgres planner used.
  • Single row inserted.
  • gp_enable_fast_sri enabled.
  • gp_enable_direct_dispatch enabled.

That’s it. It means, when one of this conditions is not true, when you using ORCA for example, tuple will go to all segments.

But how GPDB avoid duplicates? That’s simple. GPDB filters them using TupleMatchesHashFilter() function – tuples, which target segment number is not equal to current one, just will not be inserted.

Inserting to one pre-calculated segment is equal from segment point of view. The same check using TupleMatchesHashFilter() function will be performed, just to be sure tuple belongs to current segment. The main difference – tuple will be sent only to one segment.

Typical TupleMatchesHashFilter() backtrace

#0  TupleMatchesHashFilter (resultSlot=, resultNode=) at nodeResult.c:232
#1  ExecResult (node=node@entry=0x55f8676767f0) at nodeResult.c:233
#2  0x000055f8653f2f56 in ExecProcNode (node=node@entry=0x55f8676767f0) at execProcnode.c:929
#3  0x000055f8653ef962 in ExecutePlan (estate=0x55f867676128, planstate=0x55f8676767f0, operation=CMD_INSERT, numberTuples=0, direction=, 
    dest=0x55f8676800d8) at execMain.c:2900
#4  0x000055f8653f00e6 in ExecutorRun (queryDesc=queryDesc@entry=0x55f86798d058, direction=direction@entry=ForwardScanDirection, count=count@entry=0)
    at execMain.c:912
#5  0x000055f8655b9079 in ProcessQuery (portal=portal@entry=0x55f86767b158, stmt=stmt@entry=0x55f86766e1e8, params=, 
    dest=dest@entry=0x55f8676800d8, completionTag=completionTag@entry=0x7ffd3d107290 "") at pquery.c:296
#6  0x000055f8655b96bb in PortalRunMulti (portal=portal@entry=0x55f86767b158, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x55f8676800d8, 
    altdest=altdest@entry=0x55f8676800d8, completionTag=completionTag@entry=0x7ffd3d107290 "") at pquery.c:1467
#7  0x000055f8655baa2e in PortalRun (portal=portal@entry=0x55f86767b158, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
    dest=dest@entry=0x55f8676800d8, altdest=altdest@entry=0x55f8676800d8, completionTag=completionTag@entry=0x7ffd3d107290 "") at pquery.c:1029
#8  0x000055f8655b272e in exec_mpp_query (
    query_string=query_string@entry=0x55f86766c466 "insert into t_test values('704ac2be373dafe8609dff66e23f325e','d17fdec1-ac94-4572-a844-9d54f21a081d');", serializedQuerytree=serializedQuerytree@entry=0x0, serializedQuerytreelen=serializedQuerytreelen@entry=0, 
    serializedPlantree=serializedPlantree@entry=0x55f86766c4cc "7\002", serializedPlantreelen=serializedPlantreelen@entry=236, 
    serializedParams=serializedParams@entry=0x0, serializedParamslen=0, serializedQueryDispatchDesc=0x55f86766c5b8 "s", 
    serializedQueryDispatchDesclen=75, seqServerHost=0x55f86766c603 "127.0.1.1", seqServerPort=38837, localSlice=0) at postgres.c:1349
#9  0x000055f8655b7ee8 in PostgresMain (argc=, argv=argv@entry=0x55f867672e30, dbname=, username=)
    at postgres.c:5159
#10 0x000055f86554a19b in BackendRun (port=0x55f8676834b0) at postmaster.c:6732
#11 BackendStartup (port=0x55f8676834b0) at postmaster.c:6406
#12 ServerLoop () at postmaster.c:2444
#13 0x000055f86554bb28 in PostmasterMain (argc=, argv=) at postmaster.c:1528
#14 0x000055f8651d3340 in main (argc=12, argv=0x55f8676499b0) at main.c:206

From the dispatcher point of view, direct inserting to one target segment, obviously assumes hash calculation. directDispatchCalculateHash() – hashing workhorse for direct dispatching to tables with non-random hashing policy.

Typical directDispatchCalculateHash() backtrace

#0  directDispatchCalculateHash (targetPolicy=0x556f259e1830, plan=0x556f259e0de8) at cdbmutate.c:730
#1  apply_motion (root=root@entry=0x556f259e0738, plan=plan@entry=0x556f259e0de8, query=query@entry=0x556f259e00d0) at cdbmutate.c:730
#2  0x0000556f2416f8af in cdbparallelize (root=0x556f259e0738, plan=plan@entry=0x556f259e0de8, query=query@entry=0x556f259e00d0, 
    cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at cdbllize.c:272
#3  0x0000556f23e93838 in standard_planner (parse=0x556f259e00d0, cursorOptions=0, boundParams=0x0) at planner.c:348
#4  0x0000556f23e93ac5 in planner (parse=parse@entry=0x556f2557d5c0, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0)
    at planner.c:163
#5  0x0000556f23f4a013 in pg_plan_query (cursorOptions=0, boundParams=0x0, querytree=0x556f2557d5c0) at postgres.c:914
#6  pg_plan_queries (querytrees=0x556f25649888, boundParams=boundParams@entry=0x0, needSnapshot=0 '\000', cursorOptions=0) at postgres.c:991
#7  0x0000556f23f4c234 in exec_simple_query (
    query_string=query_string@entry=0x556f2557c478 "insert into t_test values('704ac2be373dafe8609dff66e23f325e','d17fdec1-ac94-4572-a844-9d54f21a081d');", seqServerHost=seqServerHost@entry=0x0, seqServerPort=seqServerPort@entry=-1) at postgres.c:1704
#8  0x0000556f23f4e55a in PostgresMain (argc=, argv=argv@entry=0x556f252d9c88, dbname=, username=)
    at postgres.c:4975
#9  0x0000556f23ee119b in BackendRun (port=0x556f252e1d10) at postmaster.c:6732
#10 BackendStartup (port=0x556f252e1d10) at postmaster.c:6406
#11 ServerLoop () at postmaster.c:2444
#12 0x0000556f23ee2b28 in PostmasterMain (argc=, argv=) at postmaster.c:1528
#13 0x0000556f23b6a340 in main (argc=15, argv=0x556f252b05f0) at main.c:206

Hash, hash, hash again, and umm… hash.

Two main goals of directDispatchCalculateHash() function are:

  • Calculate hash based on whole distribution key using multiple calls to cdbhash().
  • Decide (based on total hash value) on which segment tuple should go to using cdbhashreduce().

Let’s take a closer look at how it works.
At the very beginning, hash structure initialized by the following values:

  1. Number of segments.
  2. Reduction policy. If number of segments is a power of 2, then faster (REDUCE_BITMASK) will be used. If no – slower modulus (REDUCE_LAZYMOD).
  3. Random key for random distribution policy (not our case).

Additionally, magic hashing value applied as a base value for future hash calculation. This value is fixed and equal to (dec/hex/bit):

2166136261; 0x811c9dc5; 10000001000111001001110111000101

Right after that, distribution key hashing loop begins. Usually, it calls cdbhash() function which calculates new hash value based on previously calculated one and on current Datum value.

hashDatum() function used by cdbhash() just prepares input Datum for next processing, normalizing it (to int64 for integers; ignoring blank chars for text and others).

After normalizing done, it’s a time to calculate current Datum‘s hash. addToCdbHash() function used for that purposes. addToCdbHash(), in turn, calls fnv1_32_buf() function, which implements FNV-1 algorithm. Let’s take a look at this by example.

Imagine inserting the following tuple, which is all listed in distribution key and consist of two values:

('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d')

And this is how the first value looks like (dec/hex):

0x55f61a41bd5c:	49	102	54	54	52	101	100	51
0x55f61a41bd64:	101	101	53	52	97	57	99	55
0x55f61a41bd6c:	51	53	97	97	98	100	101	98
0x55f61a41bd74:	99	52	54	101	101	48	57	54

0x55f61a41bd5c:	0x31	0x66	0x36	0x36	0x34	0x65	0x64	0x33
0x55f61a41bd64:	0x65	0x65	0x35	0x34	0x61	0x39	0x63	0x37
0x55f61a41bd6c:	0x33	0x35	0x61	0x61	0x62	0x64	0x65	0x62
0x55f61a41bd74:	0x63	0x34	0x36	0x65	0x65	0x30	0x39	0x36

Looping over all the first value bytes, at first, we transforming hash value based on itself only using the following formula:

hval += (hval << 1) + (hval << 4) + (hval << 7) + (hval << 8) + (hval << 24);

Hash value (or base value at first iteration) plus it’s left shifted by 1, 4, 7, 8 and 24 bits values.

(hval << 1): 37305226; 0x2393b8a; 10001110010011101110001010
(hval << 4): 298441808; 0x11c9dc50; 10001110010011101110001010000
(hval << 7): 2387534464; 0x8e4ee280; 10001110010011101110001010000000
(hval << 8): 480101632; 0x1c9dc500; 11100100111011100010100000000
(hval << 24): 3305111552; 0xc5000000; 11000101000000000000000000000000

Which is 8674630943 in total, which is two times exceeds max uint32 value (4294967295).

8674630943 - (4294967295 + 1) - (4294967295 + 1) = 84696351 #new hval
84696351; 0x50c5d1; 101000011000101110100011111

Next line of hashing algorithm is the line where the current Datum begins taken to account. Bitwise XOR with first Datum‘s byte (49 - uint32; '1' - char) transforms hval to new value:

84696366; 0x50c5d2e; 101000011000101110100101110

After 32 loops (the length of first tuple Datum) our hash looking like this:

3465832814; 0xce94696e; 11001110100101000110100101101110

At the next iteration, we starting to use this value as base for next Datum‘s hash calculation.

Here is how second tuple’s value looks like (dec/hex):

0x55f61a41bf2c:	100	49	55	102	100	101	99	49
0x55f61a41bf34:	45	97	99	57	52	45	52	53
0x55f61a41bf3c:	55	50	45	97	56	52	52	45
0x55f61a41bf44:	57	100	53	52	102	50	49	97
0x55f61a41bf4c:	48	56	49	100

0x55f61a41bf2c:	0x64	0x31	0x37	0x66	0x64	0x65	0x63	0x31
0x55f61a41bf34:	0x2d	0x61	0x63	0x39	0x34	0x2d	0x34	0x35
0x55f61a41bf3c:	0x37	0x32	0x2d	0x61	0x38	0x34	0x34	0x2d
0x55f61a41bf44:	0x39	0x64	0x35	0x34	0x66	0x32	0x31	0x61
0x55f61a41bf4c:	0x30	0x38	0x31	0x64

At the 36th iteration (the length of second value) we begin the last calculation.

Left shifted values of hval (3465832814):

(hval << 1): 3348131412; 0xc7906e54; 11000111100100000110111001010100
(hval << 4): 1015247520; 0x3c8372a0; 111100100000110111001010100000
(hval << 7): 3827012864; 0xe41b9500; 11100100000110111001010100000000
(hval << 8): 3359058432; 0xc8372a00; 11001000001101110010101000000000
(hval << 24): 704643072; 0x2a000000; 101010000000000000000000000000

New hval calculated using left shifted values:

13928159006 - (4294967295 + 1) - (4294967295 + 1) - (4294967295 + 1) = 1043257118
1043257118; 0x3e2ed71e; 111110001011101101011100011110

And finally, XOR with last char of second value (100(uint32); 'd'(char)).

1043257210; 0x3e2ed77a; 111110001011101101011101111010

This is our total hash value.

The last step is to calculate target segment number based on it. In case total number of segments on cluster is a power of 2 we do bitwise AND of hash and number of segments minus one. In the other cases we do simple modulus, which, in three-segment cluster case, gives the following result:

(h->hash) % (h->numsegs) = (1043257210) % (3) = 1

That’s it. Tuple will go to second (starting from 0) segment.

The Good, the Bad and the Modulus.

Despite the fact FNV-1 function is quite strong and works well, the “Modulus” can put a spoke in someone’s wheel.

Imagine we have the following set of tuples.

('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d')
('704ac2be373dafe8609dff66e23f325e','d17fdec1-ac94-4572-a844-9d54f21a081d')
('37f824e98ffe07d2b4e232fbb6006c8f','d17fdec1-ac94-4572-a844-9d54f21a081d')
('8bf56d72a0af20a4f00896f102a589da','d17fdec1-ac94-4572-a844-9d54f21a081d')
('a7088ec9c477bda70be15bfae1daf3c2','d17fdec1-ac94-4572-a844-9d54f21a081d')

That’s right, they all with the same second value, but this should not be a problem for “Good” hash function.

Here are the hash values calculated for them.

1043257210
2813991850
151371370
1588923970
2762693290

They different. Yes, may look similar cause they divides by 10 without remainder, but they different.

Let’s calculate target segment number. As mentioned above, there are two ways of calculating segment number based on hash whichever total segments count is power of 2 or not.

At left – hash value. At up – total count of segments. At center – target segment number. (Numbers are equal if calculate all using modulus, but I’ll keep tables separately.)

		3	5	6	7	9	10	11	12	13	14	15
1043257210	1	0	4	2	7	0	6	10	8	2	10
2813991850	1	0	4	5	1	0	10	10	7	12	10
151371370	1	0	4	3	1	0	7	10	7	10	10
1588923970	1	0	4	4	7	0	7	10	10	4	10
2762693290	1	0	4	0	1	0	5	10	6	0	10
		2	4	8	16
1043257210	0	2	2	10
2813991850	0	2	2	10
151371370	0	2	2	10
1588923970	0	2	2	2
2762693290	0	2	2	10

Whoopsie. Target segment number (at center) for relatively small count of total segments (at up) is always the same, no matter what hash number (at left) was used. Using of such “Bad”, or call them “unlucky“, tuples may ruin all distribution.

To say the true, your application will unlikely generate only such “unlucky” values. Even more, changing only last value’s char (1f664ed3ee54a9c735aabdebc46ee096 to 1f664ed3ee54a9c735aabdebc46ee097) or removing any of chars will fix a problem.

BUT. If your application really generates only such values (IDK how, don’t ask me please), I can advise the following.

  • Use another distribution key.
  • Use wider distribution key with more columns.
  • Use higher number of segments.
  • Use odd or possibly prime (7+) number of segments.

Back to the future. Few words about 6x.

6x with set gp_use_legacy_hashops=on; inherits 5x’s behavior with some differences in stack trace for directDispatchCalculateHash().

#0  directDispatchCalculateHash (hashfuncs=0x55def7ac7028, targetPolicy=0x55def7ac6980, plan=0x55def7bf1df8) at cdbmutate.c:149
#1  sri_optimize_for_result (root=root@entry=0x55def7bf1968, plan=plan@entry=0x55def7bf1df8, rte=rte@entry=0x55def7bf17b8, targetPolicy=targetPolicy@entry=0x7ffe079ed850, 
    hashExprs_p=hashExprs_p@entry=0x7ffe079ed840, hashOpfamilies_p=hashOpfamilies_p@entry=0x7ffe079ed848) at cdbmutate.c:3576
#2  0x000055def046501d in adjust_modifytable_flow (is_split_updates=, node=, root=0x55def7bf1968) at createplan.c:6629
#3  make_modifytable (root=root@entry=0x55def7bf1968, operation=, canSetTag=, resultRelations=, 
    subplans=subplans@entry=0x55def7ac68e8, withCheckOptionLists=withCheckOptionLists@entry=0x0, returningLists=, is_split_updates=, 
    rowMarks=, epqParam=) at createplan.c:6492
#4  0x000055def0473771 in subquery_planner (glob=glob@entry=0x55def79cded8, parse=parse@entry=0x55def7bf1480, parent_root=parent_root@entry=0x0, 
    hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=tuple_fraction@entry=0, subroot=subroot@entry=0x7ffe079ed948, config=) at planner.c:907
#5  0x000055def0473ad8 in standard_planner (parse=0x55def7bf1480, cursorOptions=0, boundParams=0x0) at planner.c:345
#6  0x000055def0474425 in planner (parse=parse@entry=0x55def79cdff0, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at planner.c:200
#7  0x000055def054a0f6 in pg_plan_query (boundParams=, cursorOptions=, querytree=) at postgres.c:971
#8  pg_plan_queries (querytrees=, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at postgres.c:1030
#9  0x000055def054aec4 in exec_simple_query (
    query_string=0x55def79ccfd8 "insert into t_test values('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d');") at postgres.c:1766
#10 0x000055def054d5b9 in PostgresMain (argc=, argv=argv@entry=0x55def79ac720, dbname=, username=) at postgres.c:5290
#11 0x000055def04c048d in BackendRun (port=0x55def79d52e0) at postmaster.c:4811
#12 BackendStartup (port=0x55def79d52e0) at postmaster.c:4468
#13 ServerLoop () at postmaster.c:1948
#14 0x000055def04c14be in PostmasterMain (argc=6, argv=) at postmaster.c:1518
#15 0x000055def0114725 in main (argc=6, argv=0x55def79aa620) at main.c:245

And for TupleMatchesHashFilter().

#0  cdbhashreduce (h=0x557fbddd5380) at cdbhash.c:242
#1  0x0000557fb5a2f8fc in TupleMatchesHashFilter (node=0x557fbde94748, node=0x557fbde94748, resultSlot=0x557fbde93568) at nodeResult.c:279
#2  ExecResult (node=node@entry=0x557fbde94748) at nodeResult.c:229
#3  0x0000557fb59fefb8 in ExecProcNode (node=node@entry=0x557fbde94748) at execProcnode.c:970
#4  0x0000557fb5a2adfd in ExecModifyTable (node=node@entry=0x557fbde942f0) at nodeModifyTable.c:1745
#5  0x0000557fb59fefd8 in ExecProcNode (node=node@entry=0x557fbde942f0) at execProcnode.c:974
#6  0x0000557fb59f5201 in ExecutePlan (estate=estate@entry=0x557fbde92fd8, planstate=0x557fbde942f0, operation=operation@entry=CMD_INSERT, 
    sendTuples=sendTuples@entry=0 '\000', numberTuples=numberTuples@entry=0, direction=direction@entry=ForwardScanDirection, dest=0x557fbddb78d0) at execMain.c:2956
#7  0x0000557fb59f5b5e in ExecutePlan (dest=0x557fbddb78d0, direction=ForwardScanDirection, numberTuples=0, sendTuples=0 '\000', operation=CMD_INSERT, 
    planstate=, estate=0x557fbde92fd8) at execMain.c:2922
#8  standard_ExecutorRun (queryDesc=0x557fbde85848, direction=ForwardScanDirection, count=0) at execMain.c:983
#9  0x0000557fb5beda59 in ProcessQuery (portal=portal@entry=0x557fbde96ff8, stmt=stmt@entry=0x557fbddb6a58, params=, dest=dest@entry=0x557fbddb78d0, 
    completionTag=completionTag@entry=0x7fff415acef0 "", sourceText=) at pquery.c:288
#10 0x0000557fb5bedfc5 in PortalRunMulti (portal=portal@entry=0x557fbde96ff8, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x557fbddb78d0, 
    altdest=altdest@entry=0x557fbddb78d0, completionTag=completionTag@entry=0x7fff415acef0 "") at pquery.c:1473
#11 0x0000557fb5bef522 in PortalRun (portal=portal@entry=0x557fbde96ff8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
    dest=dest@entry=0x557fbddb78d0, altdest=altdest@entry=0x557fbddb78d0, completionTag=completionTag@entry=0x7fff415acef0 "") at pquery.c:1018
#12 0x0000557fb5be7ef3 in exec_mpp_query (
    query_string=0x557fbddb61d0 "insert into t_test values('1f664ed3ee54a9c735aabdebc46ee096','d17fdec1-ac94-4572-a844-9d54f21a081d');", 
    serializedQuerytree=, serializedQuerytreelen=, serializedPlantree=, serializedPlantreelen=, 
    serializedParams=, serializedParamslen=28, serializedQueryDispatchDesc=0x557fbddb638f "(\265/\375 {\245\002", serializedQueryDispatchDesclen=93)
    at postgres.c:1383
#13 0x0000557fb5bec646 in PostgresMain (argc=, argv=argv@entry=0x557fbdd95a70, dbname=, username=) at postgres.c:5436
#14 0x0000557fb5b5f48d in BackendRun (port=0x557fbddbe5e0) at postmaster.c:4811
#15 BackendStartup (port=0x557fbddbe5e0) at postmaster.c:4468
#16 ServerLoop () at postmaster.c:1948
#17 0x0000557fb5b604be in PostmasterMain (argc=5, argv=) at postmaster.c:1518
#18 0x0000557fb57b3725 in main (argc=5, argv=0x557fbdd939e0) at main.c:245

ORCA still can’t dispatch insert to target segment only and all four conditions for such dispatch still the same.

As for jump hashing, main differences are here, here (with differences in Datum’s hashing algorithms), and here, but generally, backtrace is the same.

Leave a Comment

Your email address will not be published. Required fields are marked *