How to benchmark partition table performance

This blog is published on EDB website.

This blog briefs about the new pgbench options to partition the default pgbench table pgbench_accounts and discusses the outcome of OLTP point queries and ranged queries for the two partition types range and hash for various data sizes and partition counts.



PostgreSQL : Test Coverage

Install lcov

Install Dependencies:
yum install perl-devel
yum install perl-Digest-MD5
yum install perl-GD

Download and install lcov
rpm -U lcov-1.13-1.el7.noarch.rpm


Run Test

Configure and make
Use the --enable-coverage configure flag
./configure --enable-coverage
make -j 4

Run make check
cd src/
make check -i

A file with .gcno extension is created for each source file and another with .gcda extension is generated when we run the tests.


Check Coverage

HTML output

make coverage-html

A folder named 'coverage' is generated along with the index.html file and other required data to display the coverage information. The HTML page will show a summary of the coverage for each folder and recursively for each file and then for each line.


Text output

make coverage

A .gcov and .gcov.out file is created for each source file which contains the coverage information.


Reset

make coverage-clean

This resets the execution count by removing all the .gcda files generated.


Output files

<file>.gcov.out

This list out the details for each function in the corresponding source file. An example output for a function is shown below:
Function 'heap_sync' Lines executed:100.00% of 10 Branches executed:100.00% of 4 Taken at least once:75.00% of 4 Calls executed:100.00% of 6

<file>.gcov

This displays the original file entirely along with the line number and the count of the number of times each line was executed during the test run. Lines which were never executed are marked with hashes ‘######’ and '-' indicated that the line is not executable.
-: 9258: /* main heap */ 50: 9259: FlushRelationBuffers(rel); call 0 returned 100%

.
. <more lines>
.

#####: 9283:    Page        page = (Page) pagedata;

        -: 9284:    OffsetNumber off;

        -: 9285:

    #####: 9286:    mask_page_lsn_and_checksum(page);
call    0 never executed

index.html

The home page:
This lists out all the sub directory along with their coverage data.


Per directory info:
On clicking a particular directory, we get the coverage info of each file in the selected directory.


















Select a file:
This gives out the per line hit count of the selected file. The one highlighted in blue are hit and those in red are never executed during the test run.

Postgres Crash: Segmentation Fault

Sometimes we see that the postgres server crashes while running some command and in this blog we shall see how to check if it caused by segmentation fault.

Problem:

The server crashed while I was running a command.

 server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
 The connection to the server was lost.
 Attempting reset: Failed.!>

The postgres logfile showed:
LOG: server process (PID 2779) was terminated by signal 11: Segmentation fault

Debug:

Attach gdb to the core dump generated and it will show the location which threw the segmentation fault error.  core.2779 is the name of my core dump file.

$ gdb postgres core.2779
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: emerson postgres [local] CREATE INDEX '.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000059487a in function (arguments) at file_name.c:527

527 bool hasnulls = TupleHasNulls(tuple);

From here we can determine what has caused the error.

Postgres Crash: OOM error debug

Sometimes we see that the postgres server crashes while running some command and in this blog we shall see how to check if it caused by OOM (Out of Memory) error.

Problem:

The server crashed while I was running a command.

server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
psql: FATAL:  the database system is in recovery mode

The postgres logfile showed:
2019-02-19 17:34:12.074 IST [24391] LOG: server process (PID 24403) was terminated by signal 9: Killed

dmesg revealed that the process was killed because of OOM error:
$ dmesg . . Out of memory: Kill process 24403 (postgres) score 832 or sacrifice child [20631.325314] Killed process 24403 (postgres) total-vm:5252708kB, anon-rss:1605692kB, file-rss:0kB, shmem-rss:940kB


Debug:

Open a new psql session and get the backend process id.
postgres=# SELECT pg_backend_pid(); pg_backend_pid ---------------- 5379 (1 row)
Attach gdb to the process. Set breakpoint at AllocSetAlloc and ignore some 100000 runs on that breakpoint
gdb -p 5379 (gdb) b AllocSetAllocBreakpoint 1 at 0xab6f49: file aset.c, line 716. (gdb) ignore 1 99999 Will ignore next 99999 crossings of breakpoint 1.

Run the command that caused the crash and when it breaks in gdb, call MemoryContextStats.
(gdb) call MemoryContextStats(TopMemoryContext)
The output of MemoryContextStats is seen in the server logfile. A snippet is shown below:
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used PortalContext: 5102824 total in 626 blocks; 64424 free (626 chunks); 5038400 used: ExecutorState: 8192 total in 1 blocks; 7152 free (0 chunks); 1040 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TupleSort main: 32832 total in 2 blocks; 6800 free (1 chunks); 26032 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TupleSort main: 1581120 total in 2 blocks; 6800 free (8 chunks); 1574320 used Caller tuples: 2097152 total in 9 blocks; 783776 free (2 chunks); 1313376 used
As seen, the PortalContext seems to be filling up. First check whether the current context where gdb had stopped is PortalContext.
(gdb) p *context $2 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false, methods = 0xd11840 , parent = 0x1fa58c0, firstchild = 0x203d120, prevchild = 0x0, nextchild = 0x0, name = 0xd14150 "PortalContext", ident = 0x1fa9400 "", reset_cbs = 0x0}
Since I am already at the intended context, I can simply use the gdb command backtrace, to check from where memory is been allocated and then take necessary actions like using pfree on variables or switching to a temporary context or reseting the current context, etc.

If there current MemoryContext is different, then we can set a conditional breakpoint for the intended context and when gdb halts get the backtrace.
(gdb) break aset.c:717 if $_streq(context->name, "PortalContext") Breakpoint 2 at 0xab6f47: file aset.c, line 717.

(The AllocSetAlloc starts at line number 716 in aset.c and so this breakpoint specifies the line just after it).

Partition Pruning During Execution

Partitioning in Postgresql eases handling large volumes of data. This feature has greatly improved with the introduction of declarative partitioning in PostgreSQL 10, paving way for better query optimization and execution techniques on partitioned tables. PostgreSQL 11 extended query optimization by enabling partition elimination strategies during query execution. It also added a parameter enable_partition_pruning to control the executor’s ability to prune partitions which is on by default.

When does the runtime pruning occur?

The first attempt at pruning occurs at the planning stage for the quals using partition key with constants and then for the volatile params runtime pruning can be done at two stages of execution - at executor startup or initialization and during actual execution.

1. Executor Initialization

In some cases, as in the execution of the prepared query, we can know of the parameters called the external params during the initialization and hence avoid initializing the unwanted sub-plans. In this case, EXPLAIN outputs will not list the eliminated sub-plans but only give a number of the sub-plans removed.

=# prepare tprt_q1 (int, int, int) as select * from tprt where a between $1 and $2 and b <= $3; =# explain execute tprt_q1 (25000, 30000, 20000); Append (cost=0.00..2007.54 rows=153 width=8) Subplans Removed: 7 -> Seq Scan on tprt_a3_b1 (cost=0.00..222.98 rows=17 width=8) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on tprt_a3_b2 (cost=0.00..222.98 rows=17 width=8) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))

The EXPLAIN output states that 7 sub-plans have been removed; which implies that the corresponding partitions were not required and hence not even initialized.

2. Actual Execution

As in the case of subqueries and parameterized nested loop joins; the parameters called exec params are only available at the time of actual execution. In this case, all the partitions are initialized and then the executor will determine which partitions need to be scanned depending on the parameters. In case any of the partitions are not required throughout the runtime then it is marked with “never executed” in the EXPLAIN ANALYZE output.

The following is an example of a parameterized nested loop join between two tables with 5000 rows. The outer table has values from 2001 to 7000  and the partitioned table has values from 1 to 5000. The partitioned table has 5 partitions each with the capacity of 1000 values.

EXPLAIN ANALYZE output with enable_partition_pruning = off
Nested Loop (actual rows=3000 loops=1) -> Seq Scan on t1 (actual rows=5000 loops=1) -> Append (actual rows=1 loops=5000) -> Index Scan using tp_a1_idx on tp_a1 (actual rows=0 loops=5000) Index Cond: (a = t1.col1) -> Index Scan using tp_a2_idx on tp_a2 (actual rows=0 loops=5000) Index Cond: (a = t1.col1) -> Index Scan using tp_a3_idx on tp_a3 (actual rows=0 loops=5000) Index Cond: (a = t1.col1) -> Index Scan using tp_a4_idx on tp_a4 (actual rows=0 loops=5000) Index Cond: (a = t1.col1) -> Index Scan using tp_a5_idx on tp_a5 (actual rows=0 loops=5000) Index Cond: (a = t1.col1) Planning Time: 0.319 ms Execution Time: 114.823 ms

EXPLAIN ANALYZE output with enable_partition_pruning=on
Nested Loop (actual rows=3000 loops=1) -> Seq Scan on t1 (actual rows=5000 loops=1) -> Append (actual rows=1 loops=5000) -> Index Scan using tp_a1_idx on tp_a1 (never executed) Index Cond: (a = t1.col1) -> Index Scan using tp_a2_idx on tp_a2 (never executed) Index Cond: (a = t1.col1) -> Index Scan using tp_a3_idx on tp_a3 (actual rows=1 loops=1000) Index Cond: (a = t1.col1) -> Index Scan using tp_a4_idx on tp_a4 (actual rows=1 loops=1000) Index Cond: (a = t1.col1) -> Index Scan using tp_a5_idx on tp_a5 (actual rows=1 loops=1000) Index Cond: (a = t1.col1) Planning Time: 0.384 ms Execution Time: 36.572 ms

Reasons for Performance Improvement

There is a definite improvement in the performance of queries involving partitioned tables but the extent of it is determined by partition key parameters which controls how many scans of the partitions can be skipped.

Considering the nested loop join case above, with pruning disabled, all the partitions are scanned for each of the 5000 values from the outer table t1 (loops=5000). With pruning enabled, only the appropriate partitions are scanned for each value from the outer table (loops=1000). In two partitions there are no scans performed at all (never executed) since the outer table does not have the values that match entries with these partitions (1-2000). Since the number of scans on each partition is reduced substantially, we can see an improvement of 67% in execution time from 115 ms to 37 ms.
When the amount of data in the outer table is doubled to 10000 rows (values 2001 - 12000), the behavior is similar except in the non-pruned case where the number of scans made in each partition is 10000 instead of 5000 but the difference in performance is better at 83% from 239 ms to 40 ms.

Under the Hood

The partitions are internally sorted and stored by the increasing order of the values that they can hold. Initially, in V10 the undesirable partitions were eliminated by a tedious linear search in the planner but with V11, this has been updated to quicker a binary search of the list.

To perform a scan on a partitioned table, an Append node is used with the scan on each of the leaf partitions being a sub-plan under it. Each of these sub-plans is indexed and the executor internally accesses them by this index.  

To help the executor select the correct Append sub-plans, a map of partition with the corresponding sub-plan index is used. The planner first creates this map handling all the partitions pruned by it. When the executor detects that pruning is possible, it fetches the list of partitions that can satisfy the given param and figures out the corresponding sub-plan indexes from the map. If there is no sub-plan index, it indicates that the partition has been already pruned in the previous stage (planner or executor initialization).

If the pruning has taken place during the executor startup, then the map is updated because the rejected partitions are not initialized, changing the sub-plan indexes of the retained ones. This is necessary so that the map is valid for pruning to be done by the executor later.
Supporting runtime partition pruning is just one of the few performance improvements and there is more to be expected in the upcoming versions.


--

This blog is also posted on postgres rocks.

Sharing Folder from Mac to Cent OS VM

Install VMware Tools

  1. Start up VM.
  2. Goto Virtual Machine -> Install VMware Tools
  3. There will be a pop up window. Click Install to connect the VMware Tools installer CD to this virtual machine.
  4. From the CD, Extract the VMwareTools-<version>.tar.gz
  5. Goto the folder
  6. Run the following command:
./vmware-install.pl -d --clobber-kernel-modules=vmhgfs

Share a Folder

  1. Goto Virtual Machine -> Settings
  2. In the pop up window choose Sharing
  3. Check the Enable Shared Folders option
  4. Click [+]  and select the folder on Mac which is to be shared
  5. The folder will be visible under /mnt/hgfs/

PostgreSQL Datatypes

SQL datatypes and internal datatypes

SQL type Internal type
smallint int8
int int32
bigint int64