crash on armhf (autopkgtest)

Bug #1777418 reported by Christian Ehrhardt  on 2018-06-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pgpool2 (Ubuntu)
Undecided
Unassigned

Bug Description

Found in: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-cosmic/cosmic/armhf/p/pgpool2/20180616_121023_27258@/log.gz

The issue seems to be a reproducible segfault on armhf
2018-06-18 07:17:33: pid 21486: WARNING: child process with pid: 21518 was terminated by segmentation fault
2018-06-18 07:17:33: pid 21486: LOG: fork a new child process with pid: 21537

DebCI tests only x86 https://ci.debian.net/packages/p/pgpool2/

The test is also re-reproducible on reruns and leaves a crash file.
Ugly test environment (many preocesses, env and timing dependent) took a while to reproduce in gdb with useful output.

Note: 2018-05-30 10:05:43: pid 6519: LOG: pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 6535 statement: "DROP TABLE IF EXISTS utocommit;" message: "table "utocommit" does not exist, skipping"
2018-05-30 10:05:43: pid 6519: LOG: backend [0]: NOTICE: table "utocommit" does not exist, skipping
Is a red herring, that exists on other arches as well.

Program received signal SIGSEGV, Segmentation fault.
strlen () at ../sysdeps/arm/armv6t2/strlen.S:85
85 ../sysdeps/arm/armv6t2/strlen.S: No such file or directory.
(gdb) bt
#0 strlen () at ../sysdeps/arm/armv6t2/strlen.S:85
#1 0x76af2754 in _IO_vfprintf_internal (s=s@entry=0x7ed43780,
    format=format@entry=0x1e409e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", ap=..., ap@entry=...)
    at vfprintf.c:1643
#2 0x76b60ba6 in ___vsnprintf_chk (
    s=0x1e40de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 817 statement: \"CREATE TABLE autocommit (a int);\" message: \"LE IF EXISTS utocommit;\" message: \"table \"utocommit\" "..., maxlen=<optimized out>,
    flags=1, slen=<optimized out>, format=0x1e409e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"",
    args=...) at vsnprintf_chk.c:63
#3 0x004ab7ac in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I thought it would have been an artifact of the apport/crash recovery.
But in live gdb it has the same issue to resolve which call it was.

gdb on arm needs dwarf or is odd, if readable it looks like:
#0 strlen () at ../sysdeps/arm/armv6t2/strlen.S:85
#1 0x76b3c754 in _IO_vfprintf_internal (s=s@entry=0x7e974780,
    format=format@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", ap=..., ap@entry=...)
    at vfprintf.c:1643
#2 0x76baaba6 in ___vsnprintf_chk (
    s=s@entry=0x676de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 2827 statement: \"CREATE TABLE autocommit (a int);\" message: \"BLE IF EXISTS utocommit;\" message: \"table \"utocommit"...,
    maxlen=<optimized out>, maxlen@entry=1024, flags=flags@entry=1, slen=slen@entry=4294967295,
    format=format@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", args=args@entry=...)
    at vsnprintf_chk.c:63
#3 0x0048e7ac in vsnprintf (__ap=..., __fmt=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"",
    __n=1024,
    __s=0x676de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 2827 statement: \"CREATE TABLE autocommit (a int);\" message: \"BLE IF EXISTS utocommit;\" message: \"table \"utocommit"...)
    at /usr/include/arm-linux-gnueabihf/bits/stdio2.h:77
#4 pvsnprintf (
    buf=0x676de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 2827 statement: \"CREATE TABLE autocommit (a int);\" message: \"BLE IF EXISTS utocommit;\" message: \"table \"utocommit"..., len=len@entry=1024,
    fmt=fmt@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", args=...)
    at utils/psprintf.c:124
#5 0x004bc13c in appendStringInfoVA (str=str@entry=0x7e9748dc,
    fmt=fmt@entry=0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"", args=..., args@entry=...)
    at stringinfo.c:131
#6 0x0049a12e in errmsg (fmt=0x76f5a968 <__stack_chk_guard> "") at ../../src/utils/error/elog.c:658
#7 0x0047ff60 in per_node_error_log (backend=backend@entry=0x690998, node_id=node_id@entry=0,
    query=query@entry=0x6a4190 "CREATE TABLE autocommit (a int);",
    prefix=prefix@entry=0x4e3220 "pool_send_and_wait: Error or notice message from backend: ", unread=unread@entry=1 '\001')
    at protocol/pool_proto_modules.c:3141
#8 0x0048a65e in pool_send_and_wait (query_context=query_context@entry=0x6a33e0, send_type=send_type@entry=1, node_id=0)
    at context/pool_query_context.c:830
#9 0x0047d8c4 in SimpleQuery (frontend=frontend@entry=0x69a178, backend=backend@entry=0x690998, len=33,
    contents=contents@entry=0x6a13d8 "CREATE TABLE autocommit (a int);") at protocol/pool_proto_modules.c:672
#10 0x004814d6 in ProcessFrontendResponse (frontend=frontend@entry=0x69a178, backend=backend@entry=0x690998)
    at protocol/pool_proto_modules.c:2369
---Type <return> to continue, or q <return> to quit---
#11 0x00477874 in read_packets_and_process (frontend=frontend@entry=0x69a178, backend=backend@entry=0x690998,
    reset_request=reset_request@entry=0, state=state@entry=0x7e975098, num_fields=num_fields@entry=0x7e975096,
    cont=cont@entry=0x7e9750a0 "\001") at protocol/pool_process_query.c:4755
#12 0x00477f9e in pool_process_query (frontend=0x69a178, backend=0x690998, reset_request=reset_request@entry=0)
    at protocol/pool_process_query.c:273
#13 0x00473c1a in do_child (fds=fds@entry=0x694b90) at protocol/child.c:383
#14 0x0045b494 in fork_a_child (fds=0x694b90, id=29) at main/pgpool_main.c:611
#15 0x0045ff84 in PgpoolMain (discard_status=<optimized out>, clear_memcache_oidmaps=<optimized out>) at main/pgpool_main.c:367
#16 0x00459ebc in main (argc=<optimized out>, argv=0x7e980664) at main/main.c:318

Old buffer content, is just the target, so content should not matter
(gdb) p buf
$1 = 0x676de8 "pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 2827 statement: \"CREATE TABLE autocommit (a int);\" message: \"BLE IF EXISTS utocommit;\" message: \"table \"utocommit"...

BUffer size 1024.
(gdb) p len
$2 = 1024

There is a guard on buf+1024 so it seems right.

(gdb) p fmt
$3 = 0x6769e0 "%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\""

5 elements to format.
Maybe some in the valist are broken?

va_list parsing is hard, but we should have the fill stack.

This is from
per_node_error_log in src/protocol/pool_proto_modules.c
But re-mangled in errmsg and macro around it.

3139 »···if (pool_extract_error_message(true, CONNECTION(backend, node_id), MAJOR(backend), true, &message) > 0)
3140 »···{
3141 »···»···ereport(LOG,
3142 »···»···»···(errmsg("%s: DB node id: %d backend pid: %d statement: \"%s\" message: \"%s\"",
3143 »···»···»···»··· prefix, node_id, ntohl(slot->pid), query, message)));
3144 »···}

Frame 7:
It seems that message is a bit odd.
(gdb) p prefix
$24 = 0x4e3220 "pool_send_and_wait: Error or notice message from backend: "
(gdb) p node_id
$25 = 0
(gdb) p slot->pid
$26 = 185270272
(gdb) p query
$27 = 0x6a4190 "CREATE TABLE autocommit (a int);"
(gdb) p message
$28 = 0x25000000 <error: Cannot access memory at address 0x25000000>

That is a local variable.
And should have been set by the "pool_extract_error_message" call.

On Frame 6 va_start was called on args, but the content is the same.
As the 250000000 seems wrong we need to check pool_extract_error_message on this.

It runs into this code block failing to get the message as it wants:
4366 if (kind != 'E' && kind != 'N')
4367 {
4368 pool_unread(backend, str_buf->data, readlen);
4369 »···»···»···»···pfree(str_message_buf->data);
4370 »···»···»···»···pfree(str_message_buf);
4371 »···»···»···»···pfree(str_buf->data);
4372 »···»···»···»···pfree(str_buf);
4373 ereport(ERROR,
4374 (errmsg("unable to extract error message"),
4375 errdetail("invalid message kind \"%c\"",kind)));
4376 }

kind is "'C' in my case.
That triggers PG_CATCH and exits the function with
(gdb) p *message
$24 = 0x25000000 <error: Cannot access memory at address 0x25000000>

The outer code should not use the message in this case:
pool_extract_error_message sets ret = -1
And the outer code has
> 0 to pass

But this might be the arch reasons here only hitting armhf.
On fin to the pool_extract_error_message function I see 255.

(gdb) fin
Run till exit from #0 pool_extract_error_message (read_kind=read_kind@entry=1 '\001', backend=backend@entry=0x130e9a0, major=3,
    unread=unread@entry=1 '\001', message=message@entry=0x7ee61928) at protocol/pool_process_query.c:4439
0x004e3f24 in per_node_error_log (backend=backend@entry=0x130d998, node_id=node_id@entry=0,
    query=query@entry=0x1321190 "CREATE TABLE autocommit (a int);",
    prefix=prefix@entry=0x547220 "pool_send_and_wait: Error or notice message from backend: ", unread=unread@entry=1 '\001')
    at protocol/pool_proto_modules.c:3139
3139 if (pool_extract_error_message(true, CONNECTION(backend, node_id), MAJOR(backend), true, &message) > 0)
Value returned is $26 = 255
(gdb) p *message
Cannot access memory at address 0x25000000
(gdb) n
3141 ereport(LOG,

Per Definition of the function retvals should be like:
4337 * Return values are:
4338 * 0: not error or notice message
4339 * 1: succeeded to extract error message
4340 * -1: error)

Other calls already either ignore the retval or check for == 1
3 pool_auth.c pool_do_auth 298 if (pool_extract_error_message(false, MASTER(cp), protoMajor, true,
                                                       &message) == 1)
4 pool_auth.c pool_do_auth 315 if (pool_extract_error_message(false, MASTER(cp), protoMajor, true,
                                                       &message) == 1)
5 pool_auth.c do_clear_text_password 609 if (pool_extract_error_message(false, backend, protoMajor, false, &message)
                                                       == 1)
6 pool_auth.c do_crypt 756 if (pool_extract_error_message(false, backend, protoMajor, false, &message)
                                                       == 1)
7 pool_auth.c send_password_packet 1046 if (pool_extract_error_message(false, backend, protoMajor, false, &message)
                                                       == 1)
8 pool_auth.c pool_read_kind 1220 if (pool_extract_error_message(false, MASTER(cp), MAJOR(cp), true,
                                                       &message) == 1)
3 pool_process_query.c read_kind_from_backend 3567 if (pool_extract_error_message(false, CONNECTION(backend, i),
                                                       MAJOR(backend), true, &m) == 1)
5 pool_proto_modules.c per_node_error_log 3139 if (pool_extract_error_message(true, CONNECTION(backend, node_id),
                                                       MAJOR(backend), true, &message) == 1)

This would avoid the issue here as well.
Only go on if == 1 which means found a message to report.

Related branches

I'll test the proposed fix and then report it upstream (if ok).

Not showing in debci as this tests onlx x86 on pgpool2

- repro in cosmic armhf LXD container on raspi
Or VM
- qemu-system-arm -kernel vmlinuz -initrd initrd.gz -append "root=/dev/ram" -no-reboot -nographic -m 1024 -M virt -nodefaults -serial stdio -hda ubuntu.img -netdev user,id=network0 -device virtio-net,netdev=network0,mac=52:54:00:12:34:56
Then run:
  apt install postgresql-common postgresql default-jdk postgresql-server-dev-all libpostgresql-jdbc-java ubuntu-dev-tools pgpool2 libpgpool0 libpgpool-dev postgresql-10-pgpool2
  pull-lp-source pgpool2
  cd pgpool2-3.7.3
  ./debian/tests/jdbc-tests

Patch submitted upstream, we might take Delta for it for a while but want to hear upstreams opinion on it.

=> https://github.com/pgpool/pgpool2/issues/14

Upstream acked the patch, will be in git soon and then we can pull it into Cosmic.
Later upstream releases can then sync over it.

Changed in pgpool2 (Ubuntu):
status: New → Triaged
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pgpool2 - 3.7.3-1ubuntu1

---------------
pgpool2 (3.7.3-1ubuntu1) cosmic; urgency=medium

  * d/p/lp1777418-fix-arm-crash.patch: Fix crash on arm due to different
    retval handling triggering a segfault (LP: #1777418).

pgpool2 (3.7.3-1) unstable; urgency=medium

  * New upstream version.

pgpool2 (3.7.2-1) unstable; urgency=medium

  * New upstream version.
  * Use SIGINT to stop service. Suggested by Gunnar Bluth, thanks!
    (Closes: #828769)
  * Move packaging repository to salsa.debian.org

pgpool2 (3.6.7-1) unstable; urgency=medium

  * New upstream version.
    + Supports OpenSSL 1.1. (Closes: #859552)
  * Remove pgpool2-dbg package.
  * Fix path of pg_md5 and pcp_recovery_node in scripts. (Closes: #877851)

pgpool2 (3.6.6-2) unstable; urgency=medium

  * Remove rpath from pgpool_adm.so.

pgpool2 (3.6.6-1) unstable; urgency=medium

  * Team upload for PostgreSQL 10 support.
  * New upstream version.

 -- Christian Ehrhardt <email address hidden> Mon, 18 Jun 2018 12:18:37 +0200

Changed in pgpool2 (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.