Postgres Crash: OOM error debug

Sometimes we see that the postgres server crashes while running some command and in this blog we shall see how to check if it caused by OOM (Out of Memory) error.

Problem:

The server crashed while I was running a command.

server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
psql: FATAL:  the database system is in recovery mode

The postgres logfile showed:
2019-02-19 17:34:12.074 IST [24391] LOG: server process (PID 24403) was terminated by signal 9: Killed

dmesg revealed that the process was killed because of OOM error:
$ dmesg . . Out of memory: Kill process 24403 (postgres) score 832 or sacrifice child [20631.325314] Killed process 24403 (postgres) total-vm:5252708kB, anon-rss:1605692kB, file-rss:0kB, shmem-rss:940kB


Debug:

Open a new psql session and get the backend process id.
postgres=# SELECT pg_backend_pid(); pg_backend_pid ---------------- 5379 (1 row)
Attach gdb to the process. Set breakpoint at AllocSetAlloc and ignore some 100000 runs on that breakpoint
gdb -p 5379 (gdb) b AllocSetAllocBreakpoint 1 at 0xab6f49: file aset.c, line 716. (gdb) ignore 1 99999 Will ignore next 99999 crossings of breakpoint 1.

Run the command that caused the crash and when it breaks in gdb, call MemoryContextStats.
(gdb) call MemoryContextStats(TopMemoryContext)
The output of MemoryContextStats is seen in the server logfile. A snippet is shown below:
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used PortalContext: 5102824 total in 626 blocks; 64424 free (626 chunks); 5038400 used: ExecutorState: 8192 total in 1 blocks; 7152 free (0 chunks); 1040 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TupleSort main: 32832 total in 2 blocks; 6800 free (1 chunks); 26032 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TupleSort main: 1581120 total in 2 blocks; 6800 free (8 chunks); 1574320 used Caller tuples: 2097152 total in 9 blocks; 783776 free (2 chunks); 1313376 used
As seen, the PortalContext seems to be filling up. First check whether the current context where gdb had stopped is PortalContext.
(gdb) p *context $2 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false, methods = 0xd11840 , parent = 0x1fa58c0, firstchild = 0x203d120, prevchild = 0x0, nextchild = 0x0, name = 0xd14150 "PortalContext", ident = 0x1fa9400 "", reset_cbs = 0x0}
Since I am already at the intended context, I can simply use the gdb command backtrace, to check from where memory is been allocated and then take necessary actions like using pfree on variables or switching to a temporary context or reseting the current context, etc.

If there current MemoryContext is different, then we can set a conditional breakpoint for the intended context and when gdb halts get the backtrace.
(gdb) break aset.c:717 if $_streq(context->name, "PortalContext") Breakpoint 2 at 0xab6f47: file aset.c, line 717.

(The AllocSetAlloc starts at line number 716 in aset.c and so this breakpoint specifies the line just after it).

No comments:

Post a Comment