PMDK FTBFS on ppc64el obj_basic_integration/TEST5 crashed

Bug #2061913 reported by Bryce Harrington
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
In Progress
Undecided
bugproxy
pmdk
New
Unknown
pmdk (Debian)
Confirmed
Unknown
pmdk (Ubuntu)
Fix Released
Undecided
Unassigned
Noble
Fix Released
Undecided
Unassigned
Oracular
Fix Released
Undecided
Unassigned

Bug Description

Affects ppc64el (only):

https://launchpadlibrarian.net/724116691/buildlog_ubuntu-noble-ppc64el.pmdk_1.13.1-1.1build1_BUILDING.txt.gz
https://launchpadlibrarian.net/724821331/buildlog_ubuntu-noble-ppc64el.pmdk_1.13.1-1.1build2_BUILDING.txt.gz

Also, exact failure also appears to affect Debian on same architecture:

https://buildd.debian.org/status/fetch.php?pkg=pmdk&arch=ppc64el&ver=1.13.1-1.1%2Bb1&stamp=1708597682&raw=0
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064559

obj_basic_integration/TEST5 crashed (signal 4). err5.log below.
{ut_backtrace.c:175 ut_sighandler} obj_basic_integration/TEST5:

{ut_backtrace.c:176 ut_sighandler} obj_basic_integration/TEST5: Signal 4, backtrace:
{ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 0: ./obj_basic_integration(+0xc9f8) [0x18c9f8]
{ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 1: ./obj_basic_integration(+0xcb8c) [0x18cb8c]
{ut_backtrace.c:178 ut_sighandler} obj_basic_integration/TEST5:

err5.log below.
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:175 ut_sighandler} obj_basic_integration/TEST5:
obj_basic_integration/TEST5 err5.log
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:176 ut_sighandler} obj_basic_integration/TEST5: Signal 4, backtrace:
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 0: ./obj_basic_integration(+0xc9f8) [0x18c9f8]
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 1: ./obj_basic_integration(+0xcb8c) [0x18cb8c]
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:178 ut_sighandler} obj_basic_integration/TEST5:
obj_basic_integration/TEST5 err5.log

Last 30 lines of memcheck5.log below (whole file has 48 lines).
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4915EB7: util_pool_create_uuids (set.c:2521)
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x49160FB: util_pool_create (set.c:2563)
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4941183: pmemobj_createU (obj.c:1164)
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4941643: pmemobj_create (obj.c:1244)
obj_basic_integration/TEST5 memcheck5.log ==89952== Your program just tried to execute an instruction that Valgrind
obj_basic_integration/TEST5 memcheck5.log ==89952== did not recognise. There are two possible reasons for this.
obj_basic_integration/TEST5 memcheck5.log ==89952== 1. Your program has a bug and erroneously jumped to a non-code
obj_basic_integration/TEST5 memcheck5.log ==89952== location. If you are running Memcheck and you just saw a
obj_basic_integration/TEST5 memcheck5.log ==89952== warning about a bad jump, it's probably your program's fault.
obj_basic_integration/TEST5 memcheck5.log ==89952== 2. The instruction is legitimate but Valgrind doesn't handle it,
obj_basic_integration/TEST5 memcheck5.log ==89952== i.e. it's Valgrind's fault. If you think this is the case or
obj_basic_integration/TEST5 memcheck5.log ==89952== you are not sure, please let us know and we'll try to fix it.
obj_basic_integration/TEST5 memcheck5.log ==89952== Either way, Valgrind will now raise a SIGILL signal which will
obj_basic_integration/TEST5 memcheck5.log ==89952== probably kill your program.
obj_basic_integration/TEST5 memcheck5.log ==89952==
obj_basic_integration/TEST5 memcheck5.log ==89952== HEAP SUMMARY:
obj_basic_integration/TEST5 memcheck5.log ==89952== in use at exit: 3,172 bytes in 39 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== total heap usage: 193 allocs, 154 frees, 433,659 bytes allocated
obj_basic_integration/TEST5 memcheck5.log ==89952==
obj_basic_integration/TEST5 memcheck5.log ==89952== LEAK SUMMARY:
obj_basic_integration/TEST5 memcheck5.log ==89952== definitely lost: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== indirectly lost: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== possibly lost: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== still reachable: 3,172 bytes in 39 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== suppressed: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== Reachable blocks (those to which a pointer was found) are not shown.
obj_basic_integration/TEST5 memcheck5.log ==89952== To see them, rerun with: --leak-check=full --show-leak-kinds=all
obj_basic_integration/TEST5 memcheck5.log ==89952==
obj_basic_integration/TEST5 memcheck5.log ==89952== For lists of detected and suppressed errors, rerun with: -s
obj_basic_integration/TEST5 memcheck5.log ==89952== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

There are also some instances of valgrind crashes:

pmempool_feature/TEST4: SETUP (check/pmem/debug/memcheck)
../unittest/unittest.sh: line 747: 1396902 Illegal instruction /usr/bin/valgrind --tool=memcheck --log-file=memcheck4.log --suppressions=../memcheck-dlopen.supp --suppressions=../memcheck-dlopen.supp --leak-check=full --suppressions=../ld.supp --suppressions=../memcheck-libunwind.supp --suppressions=../memcheck-ndctl.supp ../../tools/pmempool/pmempool feature -d SHUTDOWN_STATE /tmp//test_pmempool_feature4😘⠏⠍⠙⠅ɗPMDKӜ⥺🙋/testset &>> grep4.log
pmempool_feature/TEST4 crashed (signal 4).
grep4.log below.

RUNTESTS: stopping: pmempool_feature/TEST4 failed, TEST=check FS=any BUILD=debug
pmempool_feature/TEST5: SETUP (check/pmem/debug/memcheck)
../unittest/unittest.sh: line 747: 1397154 Illegal instruction /usr/bin/valgrind --tool=memcheck --log-file=memcheck5.log --suppressions=../memcheck-dlopen.supp --suppressions=../memcheck-dlopen.supp --leak-check=full --suppressions=../ld.supp --suppressions=../memcheck-libunwind.supp --suppressions=../memcheck-ndctl.supp ../../tools/pmempool/pmempool feature -d SHUTDOWN_STATE /tmp//test_pmempool_feature5😘⠏⠍⠙⠅ɗPMDKӜ⥺🙋/testset &>> grep5.log
pmempool_feature/TEST5 crashed (signal 4).
grep5.log below.
pmempool_feature/TEST5 grep5.log query SHUTDOWN_STATE result is 1

1

Last 30 lines of memcheck5.log below (whole file has 65 lines).
pmempool_feature/TEST5 memcheck5.log ==1397154== Illegal opcode at address 0x4B59240
pmempool_feature/TEST5 memcheck5.log ==1397154== at 0x4B59240: ppc_flush (init.c:53)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x4B519C7: pmem_flush (pmem.c:229)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x4B51A6B: pmem_persist (pmem.c:240)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492CA93: util_persist (util_pmem.h:27)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492CBA7: util_persist_auto (util_pmem.h:40)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492DDC3: set_hdr (feature.c:256)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492E143: feature_set (feature.c:325)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492E967: disable_shutdown_state (feature.c:500)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492EF2F: pmempool_feature_disableU (feature.c:662)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492F1AB: pmempool_feature_disable (feature.c:738)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x196897: feature_perform (feature.c:110)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x196897: pmempool_feature_func (feature.c:206)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x18A45B: main (pmempool.c:271)
pmempool_feature/TEST5 memcheck5.log ==1397154==
pmempool_feature/TEST5 memcheck5.log ==1397154== HEAP SUMMARY:
pmempool_feature/TEST5 memcheck5.log ==1397154== in use at exit: 52,839 bytes in 21 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== total heap usage: 64 allocs, 43 frees, 108,953 bytes allocated
pmempool_feature/TEST5 memcheck5.log ==1397154==
pmempool_feature/TEST5 memcheck5.log ==1397154== LEAK SUMMARY:
pmempool_feature/TEST5 memcheck5.log ==1397154== definitely lost: 0 bytes in 0 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== indirectly lost: 0 bytes in 0 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== possibly lost: 0 bytes in 0 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== still reachable: 50,479 bytes in 16 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== suppressed: 2,360 bytes in 5 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== Reachable blocks (those to which a pointer was found) are not shown.
pmempool_feature/TEST5 memcheck5.log ==1397154== To see them, rerun with: --leak-check=full --show-leak-kinds=all
pmempool_feature/TEST5 memcheck5.log ==1397154==
pmempool_feature/TEST5 memcheck5.log ==1397154== For lists of detected and suppressed errors, rerun with: -s
pmempool_feature/TEST5 memcheck5.log ==1397154== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Related branches

Changed in pmdk:
status: Unknown → New
Changed in pmdk (Debian):
status: Unknown → New
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Before the latest delta, I see:

20 tests failed:
obj_basic_integration/TEST5
obj_basic_integration/TEST6
obj_action/TEST1
obj_ctl_arenas/TEST6
obj_ctl_debug/TEST1
obj_locks/TEST1
obj_locks/TEST2
obj_mem/TEST1
obj_memcheck_register/TEST0
obj_pmalloc_mt/TEST2
obj_tx_alloc_mt/TEST2
obj_tx_locks/TEST1
obj_tx_locks/TEST2
obj_tx_locks_abort/TEST1
obj_tx_locks_abort/TEST2
out_err_mt/TEST1
out_err_mt/TEST2
pmempool_create/TEST7
pmempool_feature/TEST4
pmempool_feature/TEST5

After the delta:

19 tests failed:
obj_basic_integration/TEST6
obj_action/TEST1
obj_ctl_arenas/TEST6
obj_ctl_debug/TEST1
obj_locks/TEST1
obj_locks/TEST2
obj_mem/TEST1
obj_memcheck_register/TEST0
obj_pmalloc_mt/TEST2
obj_tx_alloc_mt/TEST2
obj_tx_locks/TEST1
obj_tx_locks/TEST2
obj_tx_locks_abort/TEST1
obj_tx_locks_abort/TEST2
out_err_mt/TEST1
out_err_mt/TEST2
pmempool_create/TEST7
pmempool_feature/TEST4
pmempool_feature/TEST5

I suppose that if we are skipping tests, we want to add these all to the ppc64el skip list so we get a successful build here.

Revision history for this message
Bryce Harrington (bryce) wrote :

I also counted 19 failed tests (that we know of). I don't have a solid feeling whether these are having the same root cause or could be multiple underlying issues. It would not surprise me, for example, if there are ppc64el-specific issues in valgrind itself, in addition to separate and unrelated issues in pmdk. Also, if there is one specific op code, for example, that causes all these failures, I have not had luck in pinpointing it and it feels like it might require at least a deep dive or even manual debugging on a ppc64el host. I also don't think it is wise to assume Debian or upstream is prepared and ready to do that at the moment.

Given the uncertainty + short timeframe, after a discussion the server team determined best approach would be to bypass the tests on ppc64el, to get a successful build, that will hopefully migrate and allow its rdepends to resolve.

We can't be certain whether these tests represent actual faults that will affect users, or are false positives or testsuite-specific issues that won't affect them. Just in case it's the former, this should be identified as a known issue in the release notes, a priority given to ascertain which is the case, and then the release notes updated and followup SRU bugs filed accordingly.

For now, we'll use this bug report for tracking purposes of the investigation of the test failures generally, but may wish to divide this into separate bug reports for more specific issues and use this as an umbrella bug report. I'll prioritize this as a "server-todo" bug for this work.

tags: added: server-todo
Bryce Harrington (bryce)
tags: added: update-excuse
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pmdk - 1.13.1-1.1ubuntu2

---------------
pmdk (1.13.1-1.1ubuntu2) noble; urgency=medium

  * Fix FTBFS issues in ppc64el:
    - d/rules: skip ppc64el build time tests
    - d/p/debian-changes: remove bogus file

 -- Athos Ribeiro <email address hidden> Thu, 18 Apr 2024 09:44:59 -0300

Changed in pmdk (Ubuntu):
status: New → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → bugproxy (bugproxy)
summary: - FTBFS ppc64el obj_basic_integration/TEST5 crashed
+ PMDK FTBFS on ppc64el obj_basic_integration/TEST5 crashed
Revision history for this message
Frank Heimes (fheimes) wrote :

I've just found this statement at the upstream github project and want to reference the info here:

https://github.com/pmem/pmdk
"
Experimental Support for PowerPC
There is initial support for ppc64le processors provided. It is currently not validated nor maintained. Thus, this architecture should not be used in a production environment.

The on-media pool layout is tightly attached to the page size of 64KiB used by default on ppc64le, so it is not interchangeable with different page sizes, includes those on other architectures. For more information on this port, contact Rajalakshmi Srinivasaraghavan (<email address hidden>) or Lucas Magalhães (<email address hidden>).
"

tags: added: reverse-proxy-bugzilla
Revision history for this message
Bryce Harrington (bryce) wrote :

Here's a short summary of current state for this issue.

In the final weeks of the noble release, an archive-wide rebuild identified a build failure in the pmdk package, due to issue(s) in the testsuite. This revealed itself as an architecture-specific problem that we suspect originated earlier in the release when a new libc was added (we didn't confirm this, but it's still our best guess). At least one instruction needed by valgrind was missing on ppc64el; see the upstream discussion at https://github.com/pmem/pmdk/issues/6079 for more details. Debian also sees similar failures in their CI.

One option would have been to change the package to not build on ppc64el. We opted to instead just bypass the testsuite, because a) we don't yet know if the issue flagged by the testsuite is going to surface as actual problems for pmdk users on this architecture in which case that could be overkill, and b) dropping the architecture might have required similar adjustments to other packages. This successfully allowed the package to migrate for the release.

However this leaves some questions to follow up on: Is pmdk/ppc64el adequately functional on Ubuntu 24.04? Should keep or drop the architecture for pmdk on Ubuntu 24.10 and going forward, given upstream's support limitations uncertainties? If a fix becomes available, should we backport it to 24.04 and/or re-enable the testsuite there?

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-206197 severity-medium targetmilestone-inin---
Changed in pmdk (Ubuntu Noble):
status: New → Fix Released
Changed in pmdk (Ubuntu Oracular):
status: Fix Released → Triaged
milestone: none → ubuntu-24.10-beta
Revision history for this message
Bryce Harrington (bryce) wrote :

This was resolved in 1.13.1-1.1ubuntu2 by skipping the build time tests on ppc64el. The issue has been forwarded upstream and to debian but there are no updates at this time. The analysis from the last comment on this bug is still valid, but at present there is no further work to be done on our end.

Changed in pmdk (Ubuntu Oracular):
status: Triaged → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2024-10-07 20:50 EDT-------
(In reply to comment #3)
> Last 30 lines of memcheck5.log below (whole file has 65 lines).
> pmempool_feature/TEST5 memcheck5.log ==1397154== Illegal opcode at address
> 0x4B59240
> pmempool_feature/TEST5 memcheck5.log ==1397154== at 0x4B59240: ppc_flush
> (init.c:53)

Looking at the source at init.c:53, that is:

asm volatile(__DCBF(0, %0, 6) : :"r"(uptr) : "memory");

The dcbf instruction is a RA|0 instruction, meaning the base address is its RA operand and its value is either the contents of RA for RA = r1-r31 or the value zero if RA = 0 (regardless of the contents of r0). My guess is that the valgrind error you are seeing is because the base register is r0 and that doesn't make sense.

The reason you could get r0 as a base register here is because for RA|0 operands, you should never use the "r" register constraint, which tells the compiler you want any register between r0 - r31. You need to use the "b" constraint here which tells the compiler to give you a register between r1 - r31.

...so this is a user source error. The fix is:

--- a/src/libpmem2/ppc64/init.c
+++ b/src/libpmem2/ppc64/init.c
@@ -50,7 +50,7 @@ ppc_flush(const void *addr, size_t size)
* According to the POWER ISA 3.1, dcbstps (aka. dcbf (L=6))
* behaves as dcbf (L=0) on previous processors.
*/
- asm volatile(__DCBF(0, %0, 6) : :"r"(uptr) : "memory");
+ asm volatile(__DCBF(0, %0, 6) : :"b"(uptr) : "memory");

uptr += CACHELINE_SIZE;
}

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-10-07 21:30 EDT-------
(In reply to comment #8)
> ...so this is a user source error. The fix is:
>
> --- a/src/libpmem2/ppc64/init.c
> +++ b/src/libpmem2/ppc64/init.c
> @@ -50,7 +50,7 @@ ppc_flush(const void *addr, size_t size)
> * According to the POWER ISA 3.1, dcbstps (aka. dcbf (L=6))
> * behaves as dcbf (L=0) on previous processors.
> */
> - asm volatile(__DCBF(0, %0, 6) : :"r"(uptr) : "memory");
> + asm volatile(__DCBF(0, %0, 6) : :"b"(uptr) : "memory");
>
> uptr += CACHELINE_SIZE;
> }

Actually, looking closer, this isn't what valgrind is complaining about. I'm a looking deeper.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-10-07 22:21 EDT-------
(In reply to comment #9)
> Actually, looking closer, this isn't what valgrind is complaining about.
> I'm a looking deeper.

The pmdk code change where this was introduced was:

- /* issue a dcbst instruction for the cache line */
- asm volatile(
- "dcbst 0,%0"
- : :"r"(uptr) : "memory");
+ /*
+ * Flush the data cache block.
+ * According to the POWER ISA 3.1, dcbstps (aka. dcbf (L=6))
+ * behaves as dcbf (L=0) on previous processors.
+ */
+ asm volatile(__DCBF(0, %0, 6) : :"r"(uptr) : "memory");

I think the problem here is that valgrind is being too smart and recognizing that the "dcbf r0,RB,6" version of the instruction is a Power10 version of dcbf where the extra L operand was added. When I execute a simple binary with "dcbf r0,RB,6" on a Power10 system (assuming RB points to some real memory), it executes fine and valgrind has no problem with it. If I take the same binary and execute it on a Power9 system, then it again executes fine, but valgrind flags the dcbf instruction as illegal.

If it is true that L=6 should act like L=0 on older than Power10 cpus, then valgrind shouldn't flag the instruction as illegal when run on those older cpus. I'll verify L=6 is ok on Power9 and earlier with our hardware team and will talk with my valgrind developer about a fix if that is the case. If it is not true, the pmdk has a source bug. I'll report back what I find out.

Revision history for this message
Frank Heimes (fheimes) wrote :

Many thanks Peter for your investigation so far!

Changed in ubuntu-power-systems:
status: New → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-10-23 19:31 EDT-------
CC'ing Carl who is going to modify Valgrind to accept the L=4 and L6= versions of dcbf on older cpus.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-10-24 19:49 EDT-------
Valgrind has a check that the L=6 option is only valid for ISA 3.1. If L=6 on non ISA 3.1 systems, Valgrind reports an error.

Per the discussion with Peter and testing the dcbf instruction on Power 9 and Power 10. The hardware "accepts" all L values from 0 to 3 on ISA 3.0 hardware (power9) where the L field is only defined to be 2-bits wide. On ISA 3.1 hardware (power10) the hardware "accepts" all L values from 0 to 7.

The Valgrind L field check was removed so the Valgrind behavior will match the hardware. In Valgrind, it really can't track the cache or memory in the intended power hardware. So, basically Valgrind recognizes the instruction and does nothing.

A patch was committed to Valgrind to remove the L field check. Now Valgrind allows all of the L values just like the real hardware.

The patch should be in the next Valgrind release 3.24.0 that is currently scheduled for 10/31/2024.

Revision history for this message
Frank Heimes (fheimes) wrote :

Thanks Peter and Carl for the update and deep investigation. Once the Valgrind patch is upstream accepted, we'll see if we go with the new Valgrind version for Ubuntu 'Plucky'-release (currently development release) and onwards and leave the the build time tests on ppc64el skipped for older releases, or if we cherry-pick the patch for the older releases ('SRU'). I'll depends a bit on the complexity ...

Ural Tunaboyu (uralt)
Changed in pmdk (Ubuntu):
milestone: ubuntu-24.10-beta → ubuntu-25.04-beta
Changed in pmdk (Debian):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.