Multicolumn Partitioning


First published here .

Multi-column partitioning allows us to specify more than one column as a partition key. Currently, multi-column partitioning is possible only for range and hash type. Range partitioning was introduced in PostgreSQL10 and hash partitioning was added in PostgreSQL 11.

Creating Partitions

To create a multi-column partition, when defining the partition key in the CREATE TABLE command, state the columns as a comma-separated list. You can specify a maximum of 32 columns.
CREATE TABLE tbl_range (id int, col1 int, col2 int, col3 int)
 PARTITION BY RANGE (col1, col2, col3);
CREATE TABLE tbl_hash (id int, col1 int, col2 int, col3 int)
 PARTITION BY HASH (col1, col2, col3);

Range

When we mention the partition bounds for a partition of a multicolumn range partitioned table, we need to specify the bound for each of the columns of the partition key in the CREATE TABLE ... PARTITION OF or the ALTER TABLE ... ATTACH PARTITION command.
CREATE TABLE p1 PARTITION OF tbl_range
 FOR VALUES FROM (1, 110, 50) TO (20, 200, 200);
ALTER TABLE tbl_range ATTACH PARTITION r1
 FOR VALUES FROM (1, 110, 50) TO (20, 200, 200);
The tuple routing section explains how these bounds work for the partition.
Please note that if the unbounded value -- MINVALUE or MAXVALUE -- is used for one of the columns, then all the subsequent columns should also use the same unbounded value.
CREATE TABLE r2 PARTITION OF tbl_range 
 FOR VALUES FROM (900, MINVALUE, MINVALUE) TO (1020, 200, 200);
ALTER TABLE tbl_range ATTACH PARTITION r3
 FOR VALUES FROM (1, 110, 50) TO (MAXVALUE, MAXVALUE, MAXVALUE);

Hash

When we mention the partition bounds for a partition of a multicolumn hash partitioned table, we need to specify only one bound irrespective of the number of columns used.
CREATE TABLE p1 PARTITION OF tbl_hash
 FOR VALUES WITH (MODULUS 100, REMAINDER 20);
ALTER TABLE tbl_hash ATTACH PARTITION h1
 FOR VALUES FROM (WITH (MODULUS 100, REMAINDER 20)

Tuple Routing

The partitioned parent table will not store any rows but routes all the inserted rows to one of the partitions based on the value of the partition key. This section explains how the tuple routing takes place for the range and hash multi-column partition key.

Range

In the range partitioned table, the lower bound is included in the table but the upper bound is excluded. In a single partitioned table with bound of 0 to 100, rows with partition key value 0 will be permitted in the partition but rows with value 100 will not.

For a multi-column range partition, the row comparison operator is used for tuple routing which means the columns are compared left-to-right, stopping at first unequal value pair. If the partition key value is equal to the upper bound of that column then the next column will be considered.

Consider a partition with bound (0,0) to (100, 50). This would accept a row with the partition key value (0, 100) because the value of the first column satisfies the partition bound of the first column which is 0 to 100 and in this case, the second column is not considered.

The partition key value (100, 49) would also be accepted because the first column value is equal to the upper bound specified and so the second column is considered here and it satisfies the restriction 0 to 50.

On the same grounds, rows with value (100, 50) or (101, 10) will not be accepted in the said partition.

Note that if any of the partition key column values are NULL then it can only be routed to the default partition if it exists else it throws an error.

Hash

In the hash partitioned case, the hash of each column value that is part of the partition key is individually calculated and then combined to get a single 64-bit hash value. The modulus operation is performed on this hash value and the remainder is used to determine the partition for the inserted row.
There is no special handling for NULL values, the hash value is generated and combined as explained above to find the partition for the row to be inserted.

Partition Pruning

One of the main reasons to use partitioning is the improved performance achieved by partition pruning. Pruning in a multi-column partitioned table has few restrictions which are explained below.
For simplicity, all examples in this section only showcase the plan time pruning using constants. This pruning capability can be seen in other plans as well where pruning is feasible like runtime pruning, partition-wise aggregation, etc.

Query using all the partition key columns

When the query uses all the partition key columns in its WHERE clause or JOIN clause, partition pruning is possible.
Consider the following multi-column range partitioned table.
                           Partitioned table "public.tbl_range"

 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           |          |         | plain   |              | 
 col1   | integer |           |          |         | plain   |              | 
 col2   | integer |           |          |         | plain   |              | 
 col3   | integer |           |          |         | plain   |              | 

Partition key: RANGE (col1, col2, col3)

Partitions: r1 FOR VALUES FROM (MINVALUE, MINVALUE, MINVALUE) TO 
                               (1000, 2000, 3000),
            r2 FOR VALUES FROM (1000, 2000, 3000) TO
                               (5000, 6000, 7000),
            r3 FOR VALUES FROM (5000, 6000, 7000) TO
                               (10000, 11000, 12000),
            r4 FOR VALUES FROM (10000, 11000, 12000) TO
                               (15000, 16000, 17000),
            r5 FOR VALUES FROM (15000, 16000, 17000) TO
                               (MAXVALUE, MAXVALUE, MAXVALUE)
The following two queries show partition pruning when using all the columns in the partition key.
postgres=# EXPLAIN SELECT * FROM tbl_range WHERE col1 = 5000 
           AND col2 = 12000 AND col3 = 14000;

                           QUERY PLAN                            
-----------------------------------------------------------------
 Seq Scan on r3 tbl_range  (cost=0.00..230.00 rows=1 width=16)
   Filter: ((col1 = 5000) AND (col2 = 12000) AND (col3 = 14000))

(2 rows)


postgres=# EXPLAIN SELECT * FROM tbl_range WHERE col1 < 5000
                               AND col2 = 12000 AND col3 = 14000;

                              QUERY PLAN                               
-----------------------------------------------------------------------
 Append  (cost=0.00..229.99 rows=2 width=16)
   -> Seq Scan on r1 tbl_range_1  (cost=0.00..45.98 rows=1 width=16)
       Filter: ((col1 < 5000) AND (col2 = 12000) AND (col3 = 14000))
   -> Seq Scan on r2 tbl_range_2  (cost=0.00..184.00 rows=1 width=16)
       Filter: ((col1 < 5000) AND (col2 = 12000) AND (col3 = 14000))
(5 rows)
Similarly, for a hash partitioned table with multiple columns in partition key, partition pruning is possible when all columns of partition key are used in a query.

Consider the following multi-column hash partitioned table.
                           Partitioned table "public.tbl_hash"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
--------+---------+-----------+----------+---------+---------+--------------+-------------
 id     | integer |           |          |         | plain   |              | 
 col1   | integer |           |          |         | plain   |              | 
 col2   | integer |           |          |         | plain   |              | 
 col3   | integer |           |          |         | plain   |              | 
Partition key: HASH (col1, col2, col3)
Partitions: h1 FOR VALUES WITH (modulus 5, remainder 0),
            h2 FOR VALUES WITH (modulus 5, remainder 1),
            h3 FOR VALUES WITH (modulus 5, remainder 2),
            h4 FOR VALUES WITH (modulus 5, remainder 3),
            h5 FOR VALUES WITH (modulus 5, remainder 4)
Query:
postgres=# EXPLAIN SELECT * FROM tbl_hash WHERE col1 = 5000 AND col2 = 12000 AND col3 = 14000;

                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Gather  (cost=1000.00..7285.05 rows=1 width=16)
   Workers Planned: 1
   -> Parallel Seq Scan on h4 tbl_hash  (cost=0.00..6284.95 rows=1 width=16)
        Filter: ((col1 = 5000) AND (col2 = 12000) AND (col3 = 14000))
(4 rows)
Unlike the range partitioned case, only equality operators support partition pruning as the < or  > operators will scan all the partitions due to the manner of tuple distribution in a hash-partitioned table.

Queries using a set of partition key columns

Since the multi-column hash partition uses a combined hash value, partition pruning is not applicable when the queries use a subset of the partition key columns.

For the range multi-column partition, however, if the query used the first few columns of the partition key, then partition pruning is still feasible. The tbl_range table described above is used here as well.

The query below only uses the first two out of the three partition key columns.
postgres=# EXPLAIN SELECT * FROM tbl_range WHERE col1 = 5000 AND col2 = 12000;

                          QUERY PLAN                           
---------------------------------------------------------------
 Seq Scan on r3 tbl_range  (cost=0.00..205.00 rows=1 width=16)
   Filter: ((col1 = 5000) AND (col2 = 12000))
(2 rows)
The query below uses only the first partition key column.
postgres=# EXPLAIN SELECT * FROM tbl_range WHERE col1 < 2000;

                                QUERY PLAN                                
--------------------------------------------------------------------------
 Append  (cost=0.00..199.97 rows=3997 width=16)
   ->  Seq Scan on r1 tbl_range_1  (cost=0.00..35.99 rows=1999 width=16)
         Filter: (col1 < 2000)
   ->  Seq Scan on r2 tbl_range_2  (cost=0.00..144.00 rows=1998 width=16)
         Filter: (col1 < 2000)
(5 rows)

Conclusion

To determine the candidate for the multi-column partition key, we should check for columns that are frequently used together in queries. For range partitioning, the sequence of columns can be from the most frequently grouped columns to the least frequently used one to enjoy the benefits of partition pruning in most cases. The sequence of columns does not matter in hash partitioning as it does not support pruning for a subset of partition key columns.



Using PQtrace

To enable PQtrace, we need to add the following code into the client-side source in the function where it establishes the connection with the server.

FILE *trace_file;
.
.
.
<PQconnectdb>
trace_file = fopen("/tmp/trace.out","w");
PQtrace(conn, trace_file);
.
.
.
fclose(trace_file);
<return>

First, declare the file variable and just after the connection is established on the client-side (by PQconnectdb), open the file with write permissions and start the trace. Do not forget to close the file before your return from the function where you have added this code.

From the file specified, we can get all the messages exchanged between the client and the server.

If you need to further debug the source of the messages being passed then run the client command from gdb with a breakpoint at PQconnectdb where it connects to the server. When the process breaks, attach another gdb process to the server process created and set the libpq function breakpoints.

In the client-side put a breakpoint on the following:
b pqPutc b pqPuts b pqPutnchar b pqPutInt b pqPutMsgStart b pqPutMsgEnd  

In the server-side put a breakpoint on the following:
b socket_putmessage

Now continue and you can easily monitor step by step how the messages are passed from both sides as it hits the breakpoints above.

Investigating bulk load operation in partitioned tables

This blog is published on the EDB website.

pgbench partitions pgbench_accounts which is the largest table and uses the bulkload command COPY populate it. The time taken to run COPY on pgbench_accounts table is logged separately. This blog will explore how this operation is affected by table partitioning. 

How to benchmark partition table performance

This blog is published on EDB website.

This blog briefs about the new pgbench options to partition the default pgbench table pgbench_accounts and discusses the outcome of OLTP point queries and ranged queries for the two partition types range and hash for various data sizes and partition counts.



PostgreSQL : Test Coverage

Install lcov

Install Dependencies:
yum install perl-devel
yum install perl-Digest-MD5
yum install perl-GD

Download and install lcov
rpm -U lcov-1.13-1.el7.noarch.rpm


Run Test

Configure and make
Use the --enable-coverage configure flag
./configure --enable-coverage
make -j 4

Run make check
cd src/
make check -i

A file with .gcno extension is created for each source file and another with .gcda extension is generated when we run the tests.


Check Coverage

HTML output

make coverage-html

A folder named 'coverage' is generated along with the index.html file and other required data to display the coverage information. The HTML page will show a summary of the coverage for each folder and recursively for each file and then for each line.


Text output

make coverage

A .gcov and .gcov.out file is created for each source file which contains the coverage information.


Reset

make coverage-clean

This resets the execution count by removing all the .gcda files generated.


Output files

<file>.gcov.out

This list out the details for each function in the corresponding source file. An example output for a function is shown below:
Function 'heap_sync' Lines executed:100.00% of 10 Branches executed:100.00% of 4 Taken at least once:75.00% of 4 Calls executed:100.00% of 6

<file>.gcov

This displays the original file entirely along with the line number and the count of the number of times each line was executed during the test run. Lines which were never executed are marked with hashes ‘######’ and '-' indicated that the line is not executable.
-: 9258: /* main heap */ 50: 9259: FlushRelationBuffers(rel); call 0 returned 100%

.
. <more lines>
.

#####: 9283:    Page        page = (Page) pagedata;

        -: 9284:    OffsetNumber off;

        -: 9285:

    #####: 9286:    mask_page_lsn_and_checksum(page);
call    0 never executed

index.html

The home page:
This lists out all the sub directory along with their coverage data.


Per directory info:
On clicking a particular directory, we get the coverage info of each file in the selected directory.


















Select a file:
This gives out the per line hit count of the selected file. The one highlighted in blue are hit and those in red are never executed during the test run.

Postgres Crash: Segmentation Fault

Sometimes we see that the postgres server crashes while running some command and in this blog we shall see how to check if it caused by segmentation fault.

Problem:

The server crashed while I was running a command.

 server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
 The connection to the server was lost.
 Attempting reset: Failed.!>

The postgres logfile showed:
LOG: server process (PID 2779) was terminated by signal 11: Segmentation fault

Debug:

Attach gdb to the core dump generated and it will show the location which threw the segmentation fault error.  core.2779 is the name of my core dump file.

$ gdb postgres core.2779
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: emerson postgres [local] CREATE INDEX '.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000059487a in function (arguments) at file_name.c:527

527 bool hasnulls = TupleHasNulls(tuple);

From here we can determine what has caused the error.

Postgres Crash: OOM error debug

Sometimes we see that the postgres server crashes while running some command and in this blog we shall see how to check if it caused by OOM (Out of Memory) error.

Problem:

The server crashed while I was running a command.

server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
psql: FATAL:  the database system is in recovery mode

The postgres logfile showed:
2019-02-19 17:34:12.074 IST [24391] LOG: server process (PID 24403) was terminated by signal 9: Killed

dmesg revealed that the process was killed because of OOM error:
$ dmesg . . Out of memory: Kill process 24403 (postgres) score 832 or sacrifice child [20631.325314] Killed process 24403 (postgres) total-vm:5252708kB, anon-rss:1605692kB, file-rss:0kB, shmem-rss:940kB


Debug:

Open a new psql session and get the backend process id.
postgres=# SELECT pg_backend_pid(); pg_backend_pid ---------------- 5379 (1 row)
Attach gdb to the process. Set breakpoint at AllocSetAlloc and ignore some 100000 runs on that breakpoint
gdb -p 5379 (gdb) b AllocSetAllocBreakpoint 1 at 0xab6f49: file aset.c, line 716. (gdb) ignore 1 99999 Will ignore next 99999 crossings of breakpoint 1.

Run the command that caused the crash and when it breaks in gdb, call MemoryContextStats.
(gdb) call MemoryContextStats(TopMemoryContext)
The output of MemoryContextStats is seen in the server logfile. A snippet is shown below:
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used PortalContext: 5102824 total in 626 blocks; 64424 free (626 chunks); 5038400 used: ExecutorState: 8192 total in 1 blocks; 7152 free (0 chunks); 1040 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TupleSort main: 32832 total in 2 blocks; 6800 free (1 chunks); 26032 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TupleSort main: 1581120 total in 2 blocks; 6800 free (8 chunks); 1574320 used Caller tuples: 2097152 total in 9 blocks; 783776 free (2 chunks); 1313376 used
As seen, the PortalContext seems to be filling up. First check whether the current context where gdb had stopped is PortalContext.
(gdb) p *context $2 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false, methods = 0xd11840 , parent = 0x1fa58c0, firstchild = 0x203d120, prevchild = 0x0, nextchild = 0x0, name = 0xd14150 "PortalContext", ident = 0x1fa9400 "", reset_cbs = 0x0}
Since I am already at the intended context, I can simply use the gdb command backtrace, to check from where memory is been allocated and then take necessary actions like using pfree on variables or switching to a temporary context or reseting the current context, etc.

If there current MemoryContext is different, then we can set a conditional breakpoint for the intended context and when gdb halts get the backtrace.
(gdb) break aset.c:717 if $_streq(context->name, "PortalContext") Breakpoint 2 at 0xab6f47: file aset.c, line 717.

(The AllocSetAlloc starts at line number 716 in aset.c and so this breakpoint specifies the line just after it).