pmproxy crash at startup in libpcp_web.so.1

Bug #2060275 reported by Martin Pitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pcp (Ubuntu)
New
Undecided
Unassigned

Bug Description

In Cockpit's CI we see a lot of pmproxy crashes like [1] in a test which starts/stops/reconfigures pmlogger, pmproxy, and redis. The journal (some examples are [2][3][4]) always shows a similar stack trace:

pmproxy[9832]: segfault at 3 ip 0000767961047e45 sp 00007ffe97e825d0 error 4 in libpcp_web.so.1[767961018000+5c000] likely on CPU 0 (core 0, socket 0)

Stack trace of thread 9832:
#0 0x0000767961047e45 n/a (libpcp_web.so.1 + 0x38e45)
#1 0x0000767961059745 n/a (libpcp_web.so.1 + 0x4a745)
#2 0x0000767961056311 n/a (libpcp_web.so.1 + 0x47311)
#3 0x0000767960f5c52b n/a (libuv.so.1 + 0x2752b)
#4 0x0000767960f5dbdb n/a (libuv.so.1 + 0x28bdb)
#5 0x0000767960f44ce8 uv_run (libuv.so.1 + 0xfce8)
#6 0x00005cae24f55097 n/a (pmproxy + 0xb097)
#7 0x00005cae24f53b6d n/a (pmproxy + 0x9b6d)
#8 0x000076796062a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca)
#9 0x000076796062a28b __libc_start_main_impl (libc.so.6 + 0x2a28b)
#10 0x00005cae24f54135 n/a (pmproxy + 0xa135)

Unfortunately that's not super useful. But I managed to reproduce it once locally and got a core dump (attached). But running it through gdb isn't super enlightening either. It does spend several minutes downloading debug symbols, but apparently not the right ones?

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.

Downloading separate debug info for /lib/libpcp_web.so.1
[... lots more ...]

(gdb) bt
#0 0x00007b1d588cbe45 in ?? () from /lib/libpcp_web.so.1
#1 0x00007b1d588dd745 in ?? () from /lib/libpcp_web.so.1
#2 0x00007b1d588da311 in ?? () from /lib/libpcp_web.so.1
#3 0x00007b1d587e052b in uv__inotify_read (loop=0x7b1d587ed180 <default_loop_struct>, dummy=<optimized out>, events=1)
    at /usr/src/libuv1-1.48.0-1/src/unix/linux.c:2466
#4 0x00007b1d587e1bdb in uv__io_poll (loop=0x7b1d587ed180 <default_loop_struct>, timeout=<optimized out>)
    at /usr/src/libuv1-1.48.0-1/src/unix/linux.c:1528
#5 0x00007b1d587c8ce8 in uv_run (loop=0x7b1d587ed180 <default_loop_struct>, mode=UV_RUN_DEFAULT) at /usr/src/libuv1-1.48.0-1/src/unix/core.c:448
#6 0x00005b98349dd097 in ?? ()
#7 0x00005b98349dbb6d in ?? ()
#8 0x00007b1d57e2a1ca in __libc_start_call_main (main=main@entry=0x5b98349db610, argc=argc@entry=3, argv=argv@entry=0x7ffc673aeac8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#9 0x00007b1d57e2a28b in __libc_start_main_impl (main=0x5b98349db610, argc=3, argv=0x7ffc673aeac8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7ffc673aeab8) at ../csu/libc-start.c:360
#10 0x00005b98349dc135 in ?? ()

So I followed the "good old dbgsym" way [5], but:

E: Unable to locate package libpcp-web1-dbgsym
E: Unable to locate package libpcp3-dbgsym
E: Unable to locate package pcp-dbgsym

The build log [6] also doesn't mention any dbgsym builds, so it seems they are missing?

[1] https://cockpit-logs.us-east-1.linodeobjects.com/pull-20264-13fcc041-20240404-201827-ubuntu-stable-other/log.html#34
[2] https://cockpit-logs.us-east-1.linodeobjects.com/pull-20264-13fcc041-20240404-201827-ubuntu-stable-other/TestHistoryMetrics-testPmProxySettings-ubuntu-stable-127.0.0.2-2201-FAIL.log.gz
[3] https://cockpit-logs.us-east-1.linodeobjects.com/pull-6177-6626b317-20240404-225904-ubuntu-stable-other-cockpit-project-cockpit/TestHistoryMetrics-testPmProxySettings-ubuntu-stable-127.0.0.2-2401-FAIL.log.gz
[4] https://cockpit-logs.us-east-1.linodeobjects.com/pull-20261-d1621935-20240404-105717-ubuntu-stable-other/TestHistoryMetrics-testPmProxySettings-ubuntu-stable-127.0.0.2-2201-FAIL.log.gz
[5] https://wiki.ubuntu.com/DebuggingProgramCrash
[6] https://launchpadlibrarian.net/714485247/buildlog_ubuntu-noble-amd64.pcp_6.2.0-1_BUILDING.txt.gz

Ubuntu 24.04
pcp 6.2.0-1

Tags: noble crash
Revision history for this message
Martin Pitt (pitti) wrote :

Sorry, clicked the wrong button, I'll expand the bug description. In the meantime, attaching the core dump.

description: updated
tags: added: crash noble
Revision history for this message
Martin Pitt (pitti) wrote :

Maybe the missing dbgsym packages are on purpose? The build log has this:

# Note: --no-automatic-dbgsym not defined for all releases up to
# and including Debian 8 (jessie), but defined after that
# ... expect a warning on older releases, but no other ill
# effects from the unknown option ... until dh_strip started
# aborting on Ubuntu 14.04 (vm00) on 23 Nov 2017
if dh_strip -a --no-automatic-dbgsym; then :; else dh_strip -a; fi

Revision history for this message
Martin Pitt (pitti) wrote : Re: Fwd: [Bug 2060275] [NEW] pmproxy crash at startup in libpcp_web.so.1

Hello Nathan,

Nathan Scott [2024-04-09 16:19 +1000]:
> Is any of this getting through... ? Just checked the Ubuntu tracker
> URL, and looks like every response Ken or I sent has been dropped on
> the ground.

Right, I didn't get any response either (not a surprise, as it's *first*
Launchpad receiving replies, and then it sends out notifications). I did do bug
replies via email in the past, but these days it may get caught in some spam
prevention measures? Probably better to just post them on the web ui?

I CC my reply to the LP bug if you don't mind -- first of all to test this, and
also to keep a more permanent record of our discussion.

> Long and short of it is, we've not been able to reproduce and debian
> dbgsym on sub-packages is still broken for unknown reasons...
> https://github.com/performancecopilot/pcp/pull/1948

It's not really unknown, it's "just" a file conflict:

| dpkg: error processing archive build/deb/pcp-pmda-infiniband-dbgsym_6.2.1-0.20240409.f312285_amd64.deb (--install):
| trying to overwrite '/usr/lib/debug/.build-id/57/02df011cfaf166b948e1fefde236eaf3a6ee65.debug', which is also in package pcp-dbgsym 6.2.1-0.20240409.f312285
|
| dpkg: error processing archive build/deb/pcp-testsuite-dbgsym_6.2.1-0.20240409.f312285_amd64.deb (--install):
| trying to overwrite '/usr/lib/debug/.build-id/17/6edc7e590f766a2ea87b5decaeb994d7c48d24.debug', which is also in package pcp-dbgsym 6.2.1-0.20240409.f312285

I.e. these are shipped in two different packages.

[1] https://github.com/performancecopilot/pcp/actions/runs/8610492722/job/23596103839?pr=1948#step:9:149

> This is not a known bug - do you know if this is specific to pcp-6.2.0
> (latest PCP) or are earlier versions affected? One change that may
> be related here is we enabled v3 PCP archives by default in 6.2.0.

We see this only in noble (i.e. upcoming 24.04). I.e. 6.0.5 in 23.10 was still
ok, and 6.2.0 occasionally crashes.

> The limited stack we have suggests we're in pmproxy log discovery
> code, in an inotify/libuv event, which does have v3-specific code.
>
> For those who can reproduce this, it'd be worth experimenting and
> setting the following field back to 2 ... (requires pmlogger restart).
>
> $ grep PCP_ARCHIVE_VERSION /etc/pcp.conf
> PCP_ARCHIVE_VERSION=3
>
> If that clears the issue, it'll help us triangulate on a possible cause.

OK -- I'll do some experimentation and report back here.

Thanks!

Martin

Revision history for this message
Martin Pitt (pitti) wrote :

Nathan Scott [2024-04-09 17:30 +1000]:
> > It's not really unknown, it's "just" a file conflict:
>
> Yeah - the unknown bit for me is "why tho" - I cannot see conflicting
> files in those packages that would have any debug symbols (there's
> some common directories... but no binaries shared AFAICS).
>
> > | dpkg: error processing archive build/deb/pcp-pmda-infiniband-dbgsym_6.2.1-0.20240409.f312285_amd64.deb (--install):
> > | trying to overwrite '/usr/lib/debug/.build-id/57/02df011cfaf166b948e1fefde236eaf3a6ee65.debug', which is also in package pcp-dbgsym 6.2.1-0.20240409.f312285
> > |
> > | dpkg: error processing archive build/deb/pcp-testsuite-dbgsym_6.2.1-0.20240409.f312285_amd64.deb (--install):
> > | trying to overwrite '/usr/lib/debug/.build-id/17/6edc7e590f766a2ea87b5decaeb994d7c48d24.debug', which is also in package pcp-dbgsym 6.2.1-0.20240409.f312285
> >
> > I.e. these are shipped in two different packages.
>
> "these"?

These two files, i.e.
/usr/lib/debug/.build-id/57/02df011cfaf166b948e1fefde236eaf3a6ee65.debug exists
both in pcp-pmda-infiniband-dbgsym and pcp-dbgsym. Presumably they shouldn't be
in the latter.

(I'm out of this for many years, so I'm afraid I don't know what a good
solution is, i.e. how much control you have over dbgsym generation).

> OK ... so that's pointing towards v3 archives a little bit, good.
>
> > > The limited stack we have suggests we're in pmproxy log discovery
> > > code, in an inotify/libuv event, which does have v3-specific code.
> > >
> > > For those who can reproduce this, it'd be worth experimenting and
> > > setting the following field back to 2 ... (requires pmlogger restart).
> > >
> > > $ grep PCP_ARCHIVE_VERSION /etc/pcp.conf
> > > PCP_ARCHIVE_VERSION=3

I created https://github.com/cockpit-project/cockpit/pull/20275 with an x120
test amplification, and intererestinly there the overwhelming majority of test
runs actually crashes there. So with that I have a fairly high confidence in
the significance of test results when trying a change.

I tested with

  sed -i 's/PCP_ARCHIVE_VERSION=3/PCP_ARCHIVE_VERSION=2/' /etc/pcp.conf

This runs on image preparation, i.e. clean /var/log and no daemons running. The
VM is freshly booted for each test, so no running pmlogger. There is no
observed change, it still crashes the same way and with the same frequency
("almost every time").

Note that I can easily pull in a PPA or even a binary with curl for testing.

Revision history for this message
Nathan Scott (nathans) wrote :
Download full text (3.2 KiB)

Hi Martin,

On Tue, Apr 9, 2024 at 6:09 PM Martin Pitt <email address hidden> wrote:
>
> Nathan Scott [2024-04-09 17:30 +1000]:
> > > It's not really unknown, it's "just" a file conflict:
> >
> > Yeah - the unknown bit for me is "why tho" - I cannot see conflicting
> > files in those packages that would have any debug symbols (there's
> > some common directories... but no binaries shared AFAICS).
> >
> > > | dpkg: error processing archive build/deb/pcp-pmda-infiniband-dbgsym_6.2.1-0.20240409.f312285_amd64.deb (--install):
> > > | trying to overwrite '/usr/lib/debug/.build-id/57/02df011cfaf166b948e1fefde236eaf3a6ee65.debug', which is also in package pcp-dbgsym 6.2.1-0.20240409.f312285
> > > |
> > > | dpkg: error processing archive build/deb/pcp-testsuite-dbgsym_6.2.1-0.20240409.f312285_amd64.deb (--install):
> > > | trying to overwrite '/usr/lib/debug/.build-id/17/6edc7e590f766a2ea87b5decaeb994d7c48d24.debug', which is also in package pcp-dbgsym 6.2.1-0.20240409.f312285
> > >
> > > I.e. these are shipped in two different packages.
> >
> > "these"?
>
> These two files, i.e.
> /usr/lib/debug/.build-id/57/02df011cfaf166b948e1fefde236eaf3a6ee65.debug exists
> both in pcp-pmda-infiniband-dbgsym and pcp-dbgsym. Presumably they shouldn't be
> in the latter.

Yep, understood - but again, I'm not understanding why. AFAICS, there
are no files with the same names or contents between those packages.

> > OK ... so that's pointing towards v3 archives a little bit, good.
> >
> > > > The limited stack we have suggests we're in pmproxy log discovery
> > > > code, in an inotify/libuv event, which does have v3-specific code.
> > > >
> > > > For those who can reproduce this, it'd be worth experimenting and
> > > > setting the following field back to 2 ... (requires pmlogger restart).
> > > >
> > > > $ grep PCP_ARCHIVE_VERSION /etc/pcp.conf
> > > > PCP_ARCHIVE_VERSION=3
>
> I created https://github.com/cockpit-project/cockpit/pull/20275 with an x120
> test amplification, and intererestinly there the overwhelming majority of test
> runs actually crashes there. So with that I have a fairly high confidence in
> the significance of test results when trying a change.
>
> I tested with
>
> sed -i 's/PCP_ARCHIVE_VERSION=3/PCP_ARCHIVE_VERSION=2/' /etc/pcp.conf
>
> This runs on image preparation, i.e. clean /var/log and no daemons running. The
> VM is freshly booted for each test, so no running pmlogger. There is no
> observed change, it still crashes the same way and with the same frequency
> ("almost every time").

OK, so it's not related to the recent v3 archive changes then.

I have a fairly recent Debian VM locally - I tried reproducing the problem
there but had no luck, it always starts fine and runs fine. Tried running it
under valgrind too just in case, but again nothing.

We also don't see this issue on Fedora, CentOS or RHEL and SuSE are
also not reporting this. Since (I think) all Debian versions are fine (can we
confirm?) it might be a Ubuntu-specific issue. Are there patches or some
global compiler option(s) unique to Ubuntu versions where this is failing?
Otherwise, I'm fresh out of ideas.

I can't really justify time chasing this any further - I think we...

Read more...

Revision history for this message
Martin Pitt (pitti) wrote :

There are no patches, it's a straight import of the source package into Ubuntu. Ubuntu *does* have different compiler options than Debian, so that may be a factor. Otherwise I'm in the same boat as you -- there's only so much time I can throw at this (I've done full-time "investigate, report, and try to reproduce regressions in various OSes" in the last two weeks).

It would certainly be good if someone from Debian or Ubuntu could figure out the debug symbol building, though. Without that, it's too hard to figure out this crash.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.