crash on armhf (autopkgtest)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pgpool2 (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
The issue seems to be a reproducible segfault on armhf
2018-06-18 07:17:33: pid 21486: WARNING: child process with pid: 21518 was terminated by segmentation fault
2018-06-18 07:17:33: pid 21486: LOG: fork a new child process with pid: 21537
DebCI tests only x86 https:/
The test is also re-reproducible on reruns and leaves a crash file.
Ugly test environment (many preocesses, env and timing dependent) took a while to reproduce in gdb with useful output.
Note: 2018-05-30 10:05:43: pid 6519: LOG: pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 6535 statement: "DROP TABLE IF EXISTS utocommit;" message: "table "utocommit" does not exist, skipping"
2018-05-30 10:05:43: pid 6519: LOG: backend [0]: NOTICE: table "utocommit" does not exist, skipping
Is a red herring, that exists on other arches as well.
Program received signal SIGSEGV, Segmentation fault.
strlen () at ../sysdeps/
85 ../sysdeps/
(gdb) bt
#0 strlen () at ../sysdeps/
#1 0x76af2754 in _IO_vfprintf_
format=
at vfprintf.c:1643
#2 0x76b60ba6 in ___vsnprintf_chk (
s=0x1e40de8 "pool_send_
flags=1, slen=<optimized out>, format=0x1e409e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"",
args=...) at vsnprintf_chk.c:63
#3 0x004ab7ac in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I thought it would have been an artifact of the apport/crash recovery.
But in live gdb it has the same issue to resolve which call it was.
gdb on arm needs dwarf or is odd, if readable it looks like:
#0 strlen () at ../sysdeps/
#1 0x76b3c754 in _IO_vfprintf_
format=
at vfprintf.c:1643
#2 0x76baaba6 in ___vsnprintf_chk (
s=s@
maxlen=
format=
at vsnprintf_chk.c:63
#3 0x0048e7ac in vsnprintf (__ap=..., __fmt=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"",
__n=1024,
__s=0x676de8 "pool_send_
at /usr/include/
#4 pvsnprintf (
buf=0x676de8 "pool_send_
fmt=
at utils/psprintf.
#5 0x004bc13c in appendStringInfoVA (str=str@
fmt=
at stringinfo.c:131
#6 0x0049a12e in errmsg (fmt=0x76f5a968 <__stack_chk_guard> "") at ../../src/
#7 0x0047ff60 in per_node_error_log (backend=
query=
prefix=
at protocol/
#8 0x0048a65e in pool_send_and_wait (query_
at context/
#9 0x0047d8c4 in SimpleQuery (frontend=
contents=
#10 0x004814d6 in ProcessFrontend
at protocol/
---Type <return> to continue, or q <return> to quit---
#11 0x00477874 in read_packets_
reset_
cont=
#12 0x00477f9e in pool_process_query (frontend=0x69a178, backend=0x690998, reset_request=
at protocol/
#13 0x00473c1a in do_child (fds=fds@
#14 0x0045b494 in fork_a_child (fds=0x694b90, id=29) at main/pgpool_
#15 0x0045ff84 in PgpoolMain (discard_
#16 0x00459ebc in main (argc=<optimized out>, argv=0x7e980664) at main/main.c:318
Old buffer content, is just the target, so content should not matter
(gdb) p buf
$1 = 0x676de8 "pool_send_
BUffer size 1024.
(gdb) p len
$2 = 1024
There is a guard on buf+1024 so it seems right.
(gdb) p fmt
$3 = 0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\""
5 elements to format.
Maybe some in the valist are broken?
va_list parsing is hard, but we should have the fill stack.
This is from
per_node_error_log in src/protocol/
But re-mangled in errmsg and macro around it.
3139 »···if (pool_extract_
3140 »···{
3141 »···»··
3142 »···»··
3143 »···»···»···»··· prefix, node_id, ntohl(slot->pid), query, message)));
3144 »···}
Frame 7:
It seems that message is a bit odd.
(gdb) p prefix
$24 = 0x4e3220 "pool_send_
(gdb) p node_id
$25 = 0
(gdb) p slot->pid
$26 = 185270272
(gdb) p query
$27 = 0x6a4190 "CREATE TABLE autocommit (a int);"
(gdb) p message
$28 = 0x25000000 <error: Cannot access memory at address 0x25000000>
That is a local variable.
And should have been set by the "pool_extract_
On Frame 6 va_start was called on args, but the content is the same.
As the 250000000 seems wrong we need to check pool_extract_
It runs into this code block failing to get the message as it wants:
4366 if (kind != 'E' && kind != 'N')
4367 {
4368 pool_unread(
4369 »···»··
4370 »···»··
4371 »···»··
4372 »···»··
4373 ereport(ERROR,
4374 (errmsg("unable to extract error message"),
4375 errdetail("invalid message kind \"%c\"",kind)));
4376 }
kind is "'C' in my case.
That triggers PG_CATCH and exits the function with
(gdb) p *message
$24 = 0x25000000 <error: Cannot access memory at address 0x25000000>
The outer code should not use the message in this case:
pool_extract_
And the outer code has
> 0 to pass
But this might be the arch reasons here only hitting armhf.
On fin to the pool_extract_
(gdb) fin
Run till exit from #0 pool_extract_
unread=
0x004e3f24 in per_node_error_log (backend=
query=
prefix=
at protocol/
3139 if (pool_extract_
Value returned is $26 = 255
(gdb) p *message
Cannot access memory at address 0x25000000
(gdb) n
3141 ereport(LOG,
Per Definition of the function retvals should be like:
4337 * Return values are:
4338 * 0: not error or notice message
4339 * 1: succeeded to extract error message
4340 * -1: error)
Other calls already either ignore the retval or check for == 1
3 pool_auth.c pool_do_auth 298 if (pool_extract_
4 pool_auth.c pool_do_auth 315 if (pool_extract_
5 pool_auth.c do_clear_
6 pool_auth.c do_crypt 756 if (pool_extract_
7 pool_auth.c send_password_
8 pool_auth.c pool_read_kind 1220 if (pool_extract_
3 pool_process_
5 pool_proto_
This would avoid the issue here as well.
Only go on if == 1 which means found a message to report.
Related branches
- Andreas Hasenack: Approve
- Canonical Server: Pending requested
- git-ubuntu developers: Pending requested
-
Diff: 79 lines (+58/-0)3 files modifieddebian/changelog (+7/-0)
debian/patches/lp1777418-fix-arm-crash.patch (+50/-0)
debian/patches/series (+1/-0)
I'll test the proposed fix and then report it upstream (if ok).