Sometimes we see that the postgres server crashes while running some command and in this blog we shall see how to check if it caused by OOM (Out of Memory) error.
Problem:
The server crashed while I was running a command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql: FATAL: the database system is in recovery mode
The postgres logfile showed:
2019-02-19 17:34:12.074 IST [24391] LOG: server process (PID 24403) was terminated by signal 9: Killed
dmesg revealed that the process was killed because of OOM error:
$ dmesg
.
.
Out of memory: Kill process 24403 (postgres) score 832 or sacrifice child
[20631.325314] Killed process 24403 (postgres) total-vm:5252708kB, anon-rss:1605692kB, file-rss:0kB, shmem-rss:940kB
Debug:
Open a new psql session and get the backend process id.
postgres=# SELECT pg_backend_pid();
pg_backend_pid
----------------
5379
(1 row)
Attach gdb to the process. Set breakpoint at AllocSetAlloc and ignore some 100000 runs on that breakpoint
gdb -p 5379
(gdb) b AllocSetAllocBreakpoint 1 at 0xab6f49: file aset.c, line 716.
(gdb) ignore 1 99999
Will ignore next 99999 crossings of breakpoint 1.
Run the command that caused the crash and when it breaks in gdb, call
MemoryContextStats.
(gdb) call MemoryContextStats(TopMemoryContext)
The output of
MemoryContextStats is seen in the server logfile. A snippet is shown below:
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
PortalContext: 5102824 total in 626 blocks; 64424 free (626 chunks); 5038400 used:
ExecutorState: 8192 total in 1 blocks; 7152 free (0 chunks); 1040 used
ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
TupleSort main: 32832 total in 2 blocks; 6800 free (1 chunks); 26032 used
Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
TupleSort main: 1581120 total in 2 blocks; 6800 free (8 chunks); 1574320 used
Caller tuples: 2097152 total in 9 blocks; 783776 free (2 chunks); 1313376 used
As seen, the PortalContext seems to be filling up. First check whether the current context where gdb had stopped is PortalContext.
(gdb) p *context
$2 = {type = T_AllocSetContext, isReset = false, allowInCritSection = false, methods = 0xd11840 ,
parent = 0x1fa58c0, firstchild = 0x203d120, prevchild = 0x0, nextchild = 0x0, name = 0xd14150 "PortalContext",
ident = 0x1fa9400 "", reset_cbs = 0x0}
Since I am already at the intended context, I can simply use the gdb command backtrace, to check from where memory is been allocated and then take necessary actions like using pfree on variables or switching to a temporary context or reseting the current context, etc.
If there current MemoryContext is different, then we can set a conditional breakpoint for the intended context and when gdb halts get the backtrace.
(gdb) break aset.c:717 if $_streq(context->name, "PortalContext")
Breakpoint 2 at 0xab6f47: file aset.c, line 717.
(The AllocSetAlloc starts at line number 716 in aset.c and so this breakpoint specifies the line just after it).