FTBFS (test fail with sigbus) on armhf in Hirsute

Bug #1919335 reported by Christian Ehrhardt 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Netgen
Fix Released
Unknown
netgen (Debian)
Fix Released
Unknown
netgen (Ubuntu)
Fix Released
Undecided
Unassigned
opencascade (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Hi,
I was checking a build fail in Ubuntu on armhf.
=> https://launchpad.net/ubuntu/+source/netgen/6.2.2006+really6.2.1905+dfsg-2/+build/20717107
It worked fine for the actual build, but then crashes in the self tests:

$ export PYTHONPATH="$PYTHONPATH:/root/netgen-6.2.2006+really6.2.1905+dfsg/debian/tmp/usr/lib/python3/dist-packages"
$ apt install python3-tk python3-numpy
$ cd ~/netgen-6.2.2006+really6.2.1905+dfsg/tests/pytest
$ LD_LIBRARY_PATH=/root/netgen-6.2.2006+really6.2.1905+dfsg/debian/tmp/usr/lib/$DEB_HOST_MULTIARCH python3 -m pytest -k test_pickling -s
...
test_pickling.py Bus error (core dumped)
This seems to be 100% reproducible, if one follow the steps that the Debian package build does.

The other tests pass

test_pickling.py::test_pickle_stl PASSED
test_pickling.py::test_pickle_occ PASSED
test_pickling.py::test_pickle_geom2d PASSED
test_pickling.py::test_pickle_mesh PASSED
Just test_pickle_csg fails.
And in this test the failing line is: geo_dump = pickle.dumps(geo)
With geo being <netgen.libngpy._csg.CSGeometry object at 0xf6da99b0>

Running that in python3-dbg and gdb into the core file shows the pickling
deep into netgen's code (which is better than a generic pickling issue I guess)

#0 0xf659c99e in ngcore::BinaryOutArchive::Write<double> (x=10000000000, this=0xffa90cc4) at ./libsrc/stlgeom/../general/../core/archive.hpp:732
#1 ngcore::BinaryOutArchive::operator& (this=0xffa90cc4, d=@0x26aa6d8: 10000000000) at ./libsrc/stlgeom/../general/../core/archive.hpp:681
#2 0xf641d4de in netgen::Surface::DoArchive (archive=..., this=0x26aa6d0) at ./libsrc/csg/surface.hpp:68
#3 netgen::OneSurfacePrimitive::DoArchive (archive=..., this=0x26aa6d0) at ./libsrc/csg/surface.hpp:344
#4 netgen::QuadraticSurface::DoArchive (this=0x26aa6d0, ar=...) at ./libsrc/csg/algprim.hpp:52
#5 0xf641dc00 in netgen::Sphere::DoArchive (this=0x26aa6d0, ar=...) at ./libsrc/csg/algprim.hpp:151
#6 0xf6434c28 in ngcore::Archive::operator&<netgen::Surface, void> (val=..., this=0xffa90cc4) at ./libsrc/csg/../general/../core/archive.hpp:307
#7 ngcore::Archive::operator&<netgen::Surface> (this=this@entry=0xffa90cc4, p=@0x2727718: 0x26aa6d0) at ./libsrc/csg/../general/../core/archive.hpp:490
#8 0xf6430dca in ngcore::Archive::Do<netgen::Surface*, void> (n=<optimized out>, data=<optimized out>, this=0xffa90cc4) at ./libsrc/csg/../general/../core/archive.hpp:280
#9 ngcore::Archive::operator&<netgen::Surface*> (v=std::vector of length 32, capacity 32 = {...}, this=0xffa90cc4) at ./libsrc/csg/../general/../core/archive.hpp:209
#10 ngcore::SymbolTable<netgen::Surface*>::DoArchive<netgen::Surface*> (ar=..., this=0x2843c64) at ./libsrc/csg/../general/../core/symboltable.hpp:44
#11 ngcore::Archive::operator&<ngcore::SymbolTable<netgen::Surface*>, void> (val=..., this=0xffa90cc4) at ./libsrc/csg/../general/../core/archive.hpp:307
#12 netgen::CSGeometry::DoArchive (this=0x2843c60, archive=...) at ./libsrc/csg/csgeom.cpp:329
#13 0xf648a958 in ngcore::Archive::operator&<netgen::CSGeometry, void> (val=..., this=0xffa90cc4) at ./libsrc/csg/../general/../core/archive.hpp:305
#14 ngcore::Archive::operator&<netgen::CSGeometry> (this=this@entry=0xffa90cc4, p=@0xffa90ba4: 0x2843c60) at ./libsrc/csg/../general/../core/archive.hpp:518
#15 0xf64a4218 in ngcore::NGSPickle<netgen::CSGeometry, ngcore::BinaryOutArchive, ngcore::BinaryInArchive>()::{lambda(netgen::CSGeometry*)#1}::operator()(netgen::CSGeometry*) const (
    self=<optimized out>, this=<optimized out>) at /usr/include/pybind11/pytypes.h:199
....
That is:
./libsrc/stlgeom/../general/../core/archive.hpp:732

 721 private:
 722 template <typename T>
 723 Archive & Write (T x)
 724 {
 725 if (unlikely(ptr > BUFFERSIZE-sizeof(T)))
 726 {
 727 stream->write(&buffer[0], ptr);
 728 *reinterpret_cast<T*>(&buffer[0]) = x; // NOLINT
 729 ptr = sizeof(T);
 730 return *this;
 731 }
 732 *reinterpret_cast<T*>(&buffer[ptr]) = x; // NOLINT
 733 ptr += sizeof(T);
 734 return *this;
 735 }
 736 };
With the variables in the crash file being:
(gdb) p &buffer
$5 = (std::array<char, 1024> *) 0xffa90d40
(gdb) p ptr
$3 = 1

Depending on how the real code (not gdb on the crash file) interprets this pointer addition that might explain the SigBus as it reflects unaligned access and if it adds that up to just "0xffa90d41" (which happens in gdb) then it fails.

I'm a bit lost as .hpp backends to serialize/pickle python files really isn't my home turf :-/
Therefore I wanted to reach out to you as experts on netgen if this makes sense to you.
I can keep the repro-systems around for a while, so if you have debug-questions or small modifications to try I should be able test them.

P.S. The reason this didn't show up in the past is because before the tests were not correctly run at build time, the last Debian upload fixed that and since then it is an FTFBS. But it seems not to trigger in all environments, e.g. in the Debian builds it did not crash the same way.

FYI: I'm not entirely sure, there also is this recent bug about unaligned access - but the logs linked there didn't look to be "the same". Still as FYI: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=984439

Note: I've reported the very same bug upstream and will link it, this LP bug is meant as tracker to be found via the update-excuse tag.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since it was broken before in
  https://launchpad.net/ubuntu/+source/netgen/6.2.2006+really6.2.1905+dfsg-2
And only now shows up because of
  " * [5426125] Fix running tests"
being in
  https://launchpad.net/ubuntu/+source/netgen/6.2.2006+really6.2.1905+dfsg-2

I wonder if skipping this test on armhf would the right way to mitigate it for the time being and not get things stuck in proposed until really resolved. After all it seems that it would not be a degradation to before (on armhf).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Adding an "opencascade" task as that is blocked in hirsute-proposed due to this.

tags: added: update-excuse
Changed in netgen:
status: Unknown → New
Changed in netgen (Debian):
status: Unknown → New
Changed in netgen (Ubuntu):
status: New → Fix Released
Changed in opencascade (Ubuntu):
status: New → Fix Released
Changed in netgen:
status: New → Fix Released
Changed in netgen (Debian):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.