monero FTBFS on riscv64

Bug #1940505 reported by Lukas Märdian
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
binutils
Fix Released
Medium
binutils (Ubuntu)
Invalid
Undecided
Unassigned
monero (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

monero FTBFS on riscv64 due to toolchain issues:

[ 64%] Linking CXX executable signature_fuzz_tests
cd "/<<PKGBUILDDIR>>/obj-riscv64-linux-gnu/tests/fuzz" && /usr/bin/cmake -E cmake_link_script CMakeFiles/signature_fuzz_tests.dir/link.txt --verbose=1
/usr/bin/c++ -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -pthread -DNO_AES -fno-strict-aliasing -std=c++11 -D_GNU_SOURCE -Wall -Wextra -Wpointer-arith -Wundef -Wvla -Wwrite-strings -Wno-error=extra -Wno-error=deprecated-declarations -Wno-unused-parameter -Wno-unused-variable -Wno-error=unused-variable -Wno-error=undef -Wno-error=uninitialized -Wlogical-op -Wno-error=maybe-uninitialized -Wno-error=cpp -Wno-reorder -Wno-missing-field-initializers -fPIC -Wformat -Wformat-security -fstack-protector -fstack-protector-strong -fstack-clash-protection -fno-strict-aliasing -ftemplate-depth=900 -Wl,-Bsymbolic-functions -Wl,-z,relro -pie -Wl,-z,relro -Wl,-z,now -Wl,-z,noexecstack -rdynamic CMakeFiles/signature_fuzz_tests.dir/signature.cpp.o CMakeFiles/signature_fuzz_tests.dir/fuzzer.cpp.o -o signature_fuzz_tests ../../lib/libwallet.a ../../src/cryptonote_core/libcryptonote_core.a ../../src/p2p/libp2p.a ../../contrib/epee/src/libepee.a ../../src/device/libdevice.a -Wl,-Bstatic -lrt -Wl,-Bdynamic -ldl ../../src/rpc/librpc_base.a ../../src/mnemonics/libmnemonics.a ../../src/device_trezor/libdevice_trezor.a ../../src/cryptonote_core/libcryptonote_core.a ../../src/multisig/libmultisig.a ../../src/blockchain_db/libblockchain_db.a ../../external/db_drivers/liblmdb/liblmdb.a ../../src/ringct/libringct.a ../../src/cryptonote_basic/libcryptonote_basic.a ../../src/device/libdevice.a -lhidapi-libusb ../../src/ringct/libringct_basic.a ../../src/blocks/libblocks.a ../../src/checkpoints/libcheckpoints.a ../../src/hardforks/libhardforks.a ../../src/net/libnet.a ../../src/common/libcommon.a ../../src/crypto/libcncrypto.a ../../contrib/epee/src/libepee.a ../../external/easylogging++/libeasylogging.a -lrandomx -lunbound -lboost_regex -lboost_date_time -lssl -lcrypto -lzmq -lpgm -lnorm -lgssapi_krb5 -lsodium ../../src/libversion.a -lminiupnpc -lboost_chrono -lboost_serialization -lboost_filesystem -lboost_system -lboost_thread -lboost_program_options -Wl,-Bstatic -lrt -Wl,-Bdynamic -ldl -latomic
../../external/db_drivers/liblmdb/liblmdb.a(mdb.c.o): in function `mdb_node_add':
./obj-riscv64-linux-gnu/external/db_drivers/liblmdb/./external/db_drivers/liblmdb/mdb.c:8245:(.text+0x44e6): relocation truncated to fit: R_RISCV_JAL against `mdb_assert_fail.constprop.0'
collect2: error: ld returned 1 exit status
make[3]: *** [tests/fuzz/CMakeFiles/signature_fuzz_tests.dir/build.make:166: tests/fuzz/signature_fuzz_tests] Error 1
make[3]: Leaving directory '/<<PKGBUILDDIR>>/obj-riscv64-linux-gnu'
make[2]: *** [CMakeFiles/Makefile2:4165: tests/fuzz/CMakeFiles/signature_fuzz_tests.dir/all] Error 2
make[2]: Leaving directory '/<<PKGBUILDDIR>>/obj-riscv64-linux-gnu'
make[1]: *** [Makefile:163: all] Error 2
make[1]: Leaving directory '/<<PKGBUILDDIR>>/obj-riscv64-linux-gnu'
dh_auto_build: error: cd obj-riscv64-linux-gnu && make -j1 "INSTALL=install --strip-program=true" VERBOSE=1 returned exit code 2
make: *** [debian/rules:47: binary-arch] Error 25
dpkg-buildpackage: error: debian/rules binary-arch subprocess returned exit status 2

Revision history for this message
In , Yitingwang16 (yitingwang16) wrote :
Download full text (3.2 KiB)

If there are two callers in two .text sections (for example sections A, B), followed by two sections with R_RISCV_ALIGN type relocations (section C, D), then followed by the called functions in sections behind (section E), there is a chance to reproduce the issue.

I draw a picture to illustrate the problem:

                   ____________________________________
       high | |
        ^ -> | section E: cc, dd | <-
  | | |___________________________________| |
        | | | | |
        | | | section D: align2 (R_RISCV_ALIGN) | |
        | | |___________________________________| |
        | | | | |
        | | | section C: align1 (R_RISCV_ALIGN) | |
        | | |___________________________________| |
        | | | | |
        | | | section B : ff (R_RISCV_CALL) |--
        | | |___________________________________|
        | | | |
  address low -- | section A : _start (R_RISCV_CALL) |
                   |___________________________________|

Consider this situation: Section A, B, C, D, E are linked in sequence. Section C and D have R_RISCV_ALIGN type relocations. Function _start() calls function cc(), function ff() calls dd(). The size of each section is 8, 8, 64, 0xfffbc, 10 bytes before relaxation.

In the first relaxation pass (info->relax_pass ==0), the R_RISCV_CALL relocations in section A and B couldn't be relaxed to R_RISCV_JAL, because the call from _start() to cc() and the call from ff() to dd() don't fit with 21-bit offset. The offset is 0x10001c indeed. In the first relaxation pass, nothing is done.

In the third relaxation pass (info->relax_pass ==2), the R_RISCV_ALIGN relocations in section C and D could be relaxed. The sizes of section C and D were changed to 34 and 0xfffba bytes. Now, the offset of the call from _start() to cc() is 0xffffe, the offset from ff() to dd() equals to this value too.

In the second round of relaxation, in the first pass (info->relax_pass ==0), the R_RISCV_CALL relocation in section A could be relaxed to R_RISCV_JAL. After the relaxation, section A's size is reduced by 4 bytes. So, section B's base address and every symbol go down 4 bytes forward. However section C, D and E will not go down 4 bytes, because of the .balign restriction placed in section C and D.
But, when linker processes the relaxation of section B, it uses the original section B symbol addresses to calculate the offset of R_RISCV_CALL relocation (from ff() to dd()). The offset is within 1M bytes offset (0xffffe). So, the R_RISCV_CALL relocation is relaxed to R_RISCV_JAL. This is a mistake!
When linker finally performs the relocation (in perform_relocation()), the section B's symbol has been adjusted, then the bug is exposed.

I created a simple test case to trigger this issue:

    riscv64-unknown-linux-gnu-gcc -nostdlib -o a.out a.s align_1.s align_2.s c.s
    /tmp/ccgFgKhw.o: In function `ff':
    (.text+0x0): relocation truncated to fit: R_RISCV_...

Read more...

Revision history for this message
In , Yitingwang16 (yitingwang16) wrote :

Created attachment 12070
test case

Revision history for this message
In , Yitingwang16 (yitingwang16) wrote :

I have a fix locally which changes the common ld/ldlang.c. I can post to binutils mailing list for review.

I am not sure whether there is a possibility of a similar problem occurring *between* passes instead of within a single pass. I.e. what if a call in B is within range and so gets relaxed in one pass, then in a later pass B moves down due to relaxation in a previous section - but the destination stays in the same place due to alignment? Is that possible?

Revision history for this message
In , Yitingwang16 (yitingwang16) wrote :

I've not figured out how to send a proper patch to binutils mailing list.

github is much easier. See https://github.com/riscv/riscv-binutils-gdb/pull/185

Revision history for this message
In , Wilson-5 (wilson-5) wrote :

The way that this should work is that if the call crosses section boundaries, then we need to use the max alignment of any section between the call instruction and the target, including the call and target sections. Except we don't have code to compute that yet, so we just use the maximum alignment of all sections in the output. If the call does not cross section boundaries, then it should only use the alignment of the section it is in. This alignment value is added in to the offset to make sure that we don't relax something that might not fit after alignment is handled. This has the downside that sometimes we refuse to relax something that might be relaxable because of a section with large alignment constraint. But it is safe, in that we never accidentally relax something that should not be relaxed.

This is how _bfd_riscv_relax_lui and _bfd_riscv_relax_pc work. They use max_alignment unconditionally, and set max_alignment to the section alignment if the lui and gp are in the same section and it isn't the abs section.

This is unfortunately not how _bfd_riscv_relax_call works. It only adds in the alignment if the call and target are not in the same section. I'd call this a bug and that it needs to work the same as the other two functions.

I attached a proposed patch that fixes _bfd_riscv_relax_call and adds your example as a testcase.

Revision history for this message
In , Wilson-5 (wilson-5) wrote :

Created attachment 12071
untested patch to fix _bfd_riscv_relax_call

Revision history for this message
In , Wilson-5 (wilson-5) wrote :

You can create a patch with git diff and then attach it to the bug report. It needs to be mailed to the binutils list if it gets checked in, but you can always ask someone else to do that for you.

Contributing a patch to Binutils requires an FSF copyright assignment. I don't see an assignment in your name, and it isn't clear who you work for. Though from github I see you are in Beijing, so Alibaba and Huawei would be my first two guesses. I see GCC assignments for both of them, but not Binutils assignments. If you are serious about trying to contribute to GNU toolchain, then you should try to get a copyright assignment. Meanwhile, it might be simpler if someone who already has an assignment writes a patch.

Generally, I agree with Nelson Chu's github comment that we would prefer to avoid a ldlang.c change if we can. We might be able to get a better result by changing ldlang.c instead of the RISC-V port, but a change to ldlang.c requires a lot more testing because it affects all targets.

Revision history for this message
In , Yitingwang16 (yitingwang16) wrote :

Hi Jim,

I am very grateful to you for fixing the defect very quickly and giving me the instructions on how to contributing a patch to Binutils! My colleague has verified your patch works fine. :)

I am a WindRiver engineer, working on VxWorks. I mainly do system software and know very little toolchains. I just tried to contribute a patch. I see what you and Nelson Chu mean, we'd better not modify the common file, however, I didn't know how to fix elfnn-riscv.c. I sure am glad you resolved the defect, thank you again. :)

Revision history for this message
In , Cvs-commit (cvs-commit) wrote :

The master branch has been updated by Jim Wilson <email address hidden>:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=c6261a00c3e70dd8e508062ea43a1bcb6d547621

commit c6261a00c3e70dd8e508062ea43a1bcb6d547621
Author: Jim Wilson <email address hidden>
Date: Tue Nov 12 15:50:48 2019 -0800

    RISC-V: Fix ld relax failure with calls and align directives.

    Make _bfd_riscv_relax_call handle section alignment padding same as
    the _bfd_riscv_relax_lui and _bfd_riscv_relax_pc functions already
    do. Use the max section alignment if section boundaries are crossed,
    otherwise the alignment of the containing section.

     bfd/
     PR 25181
     * elfnn-riscv.c (_bfd_riscv_relax_call): Always add max_alignment to
     foff. If sym_sec->output_section and sec->output_section are the same
     and not *ABS* then set max_alignment to that section's alignment.

     ld/
     PR 25181
     * testsuite/ld-riscv-elf/call-relax-0.s: New file.
     * testsuite/ld-riscv-elf/call-relax-1.s: New file.
     * testsuite/ld-riscv-elf/call-relax-2.s: New file.
     * testsuite/ld-riscv-elf/call-relax-3.s: New file.
     * testsuite/ld-riscv-elf/call-relax.d: New test.
     * testsuite/ld-riscv-elf/ld-riscv-elf.exp: Run call-relax test.

    Change-Id: Iaf65cee52345abf1955f36e8e72c4f6cc0db8d9a

Revision history for this message
In , Wilson-5 (wilson-5) wrote :

Fixed on mainline.

Lukas Märdian (slyon)
no longer affects: xmr-stak-cpu
summary: - FTBFS on riscv64
+ monero FTBFS on riscv64
Revision history for this message
Lukas Märdian (slyon) wrote :

Some context:
https://github.com/riscv/riscv-gnu-toolchain/issues/738
https://sourceware.org/bugzilla/show_bug.cgi?id=25181

According to the latter, this should already be fixed in upstream binutils since several years... why does it still happen?

Changed in binutils:
importance: Unknown → Medium
status: Unknown → Fix Released
tags: added: update-excuse
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package monero - 0.17.2.0+~0+20200826-1ubuntu2

---------------
monero (0.17.2.0+~0+20200826-1ubuntu2) impish; urgency=medium

  * Build with -O1 on riscv64 to avoid link error (LP: #1940505)
  * Limit parallelism only on certain architectures

 -- Graham Inggs <email address hidden> Mon, 20 Sep 2021 16:11:23 +0000

Changed in monero (Ubuntu):
status: New → Fix Released
Lukas Märdian (slyon)
Changed in binutils (Ubuntu):
status: New → Invalid
Lukas Märdian (slyon)
Changed in binutils (Ubuntu):
status: Invalid → Confirmed
status: Confirmed → New
Revision history for this message
Lukas Märdian (slyon) wrote :

monero builds fine nowadays on riscv64

Changed in binutils (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.