bacula-fd segfault on status client from Bat

Bug #1800040 reported by Richard Neighbour
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bacula (Ubuntu)
Triaged
Low
Unassigned

Bug Description

Architecture: armhf - bacula-fd, bacula-dir, bacula-sd, bconsole (hostname: selene)
              amd64 - bacula-console-qt (bat)
Ubuntu versions: Bionic and cosmic (bionic bat was on Mint 19, cosmic bat is Ubuntu 18.10)
Bacula versions 9.0.6 and 9.0.8
Repeatable: Yes,every time

From Bat (bacula-console-qt) select clients, right-click on the client for selene and choose Status Client.

Within seconds the bacula-fd crashes, bat shows a could not connect error message in the console page.
This only happens to the bacula-fd on the armhf host, which also has the bacula server running. All the other clients (a mix of Windows & Linux 32 and 64bit) from version 5.2.10 onwards) work fine.

I've just found that running status client for selene-fd in bconsole on cosmic/bacula 9.0.8 works properly, don't know if worked on bionic/bacula 9.0.6 or not. Too late to test now without reverting the upgrade.
It did not fail on xenial/bacula 7.0.5

Journalctl -xe shows
Oct 25 22:22:10 selene bacula-fd[1634]: Bacula interrupted by signal 11: Segmentation violation
Oct 25 22:22:10 selene bacula-fd[1634]: Kaboom! bacula-fd, selene-fd got signal 11 - Segmentation violation at 25-Oct-2018 22:22:09. Attempting traceback.
Oct 25 22:22:10 selene bacula-fd[1634]: Kaboom! exepath=/usr/sbin/
Oct 25 22:22:09 selene bacula-fd[1634]: Bacula interrupted by signal 11: Segmentation violation
Oct 25 22:22:10 selene bacula-fd[1634]: Calling: /usr/sbin/btraceback /usr/sbin/bacula-fd 1634 /var/lib/bacula
Oct 25 22:22:11 selene bacula-fd[1634]: It looks like the traceback worked...
Oct 25 22:22:11 selene bacula-fd[1634]: LockDump: /var/lib/bacula/bacula.1634.traceback
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: lockmgr.c:1179-0 lockmgr disabled
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 536 bytes at 13458d0 from jcr.c:386
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 280 bytes at 1351b90 from jcr.c:390
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 4120 bytes at 1351f20 from bnet.c:611
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 4120 bytes at 1352f58 from bnet.c:610
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 536 bytes at 75300ae0 from bnet.c:612
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 280 bytes at 75300d18 from jcr.c:384
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 280 bytes at 75300e50 from jcr.c:388
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 280 bytes at 753012a8 from job.c:283
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 280 bytes at 75301420 from find.c:57
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 536 bytes at 75301598 from output.c:103
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 304 bytes at 1342b00 from bnet.c:601
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 7 bytes at 1342298 from bnet.c:613
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 14 bytes at 1342c50 from bnet.c:614
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 8 bytes at 13517e0 from workq.c:198
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 16 bytes at 74900ab0 from jcr.c:372
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 768 bytes at 74900ae0 from find.c:54
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 28 bytes at 74900e00 from job.c:282
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 34 bytes at 74900e40 from job.c:285
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 6 bytes at 74900f10 from job.c:2082
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 56 bytes at 74900f38 from status.c:469
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 280 bytes at 74900f90 from output.c:205
Oct 25 22:22:11 selene bacula-fd[1634]: selene-fd: smartall.c:400-2863311530 Orphaned buffer: selene-fd 280 bytes at 749010c8 from output.c:206
Oct 25 22:22:11 selene systemd[1]: bacula-fd.service: Main process exited, code=exited, status=11/n/a
Oct 25 22:22:11 selene systemd[1]: bacula-fd.service: Failed with result 'exit-code'.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
I tried to do the same from "bat" but it just happened to work for me. See attached screenshot (I waited a bit, no fail later).

I had bacula just configured by installing it and pressing enter at every question (config for local pgsql DB essentially). I have not changed any other config and I ran on x86 (in a bionic container called "b").

I now wonder if this is either armhf specific or to your bacula config.
Could you:
- try to reproduce on armhf with the least steps to configue (like my install, enter enter ..., start bat, check status) if it fails as well?
- try to reproduce the same on x86?
- clarify - do you use remove clients in any way (so that x86 vlient vs armhf host might be a reason)?

Changed in bacula (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Actually I guess "mix of Windows & Linux 32 and 64bit" implies that x86 client was tested.

So let me try to clarify:
- server armhf, bacula-fd armhf, bat on remote x86 - works
- server armhf, bacula-fd armhf, bat on same armhf - fails

Is that correct?
If so does it also fail right out of the box with mnimal config?

Also any chance to try to break in on x86 only for more people able to reproduce it.
- server x86, bacula-fd x86, bat on same x86 - ??

Revision history for this message
Richard Neighbour (richard-neighbour) wrote :

Hi Christian

Yes, x86 clients tested on Win10, Win7, Win7x64, Win-Ancient_Server (5.2.10, 7.0.5, 7.4.4, 9.0.6), Mint x64, Mint i386 (7.0.5), CentOS 7.5 (5.2.13), Ubuntu 18.04 (9.0.6). No clients have been deleted recently, although a few won't see in the new year. The Windows clients will be getting updated to at least 7.4.4 over the next few months, the linux ones will stay with their distro defaults.

Results so far, all Bacula 9.0.8 on Cosmic

Test Director FD Bat client status result
1 armhf armhf x86 fails - bacula-fd terminated
2 armhf x86 x86 works (various bacula-fd versions)
3 x86 x86 x86 works
4 x86 armhf x86 fails - bacula-fd terminated.

5 armhf armfh armhf ToDo

Tests 3 & 4 were a new install Bacula server on Cosmic, minimal config. The armhf bacula-fd in test 4 was the same as used for Test 1, updated to permit the x86 director

I guess it'll take me a few days to see if bat can work on armhf since there's no desktop (or keyboard/monitor right now). And I suspect it'll run a bit slow and warm too. :-)

-R

Robie Basak (racb)
Changed in bacula (Ubuntu):
status: Incomplete → New
Revision history for this message
Richard Neighbour (richard-neighbour) wrote :

Test 5 - Bat on armhf fails too.

Test Director FD . . Bat . .Client Status result
1 . .armhf . .armhf .x86 . .fails - bacula-fd terminated
2 . .armhf . .x86 . .x86 . .works (various bacula-fd versions)
3 . .x86 . . .x86 . .x86 . .works
4 . .x86 . . .armhf .x86 . .fails - bacula-fd terminated.
5 . .armhf . .armhf .armhf .Fails - bacula-fd terminated.

Host disks were cloned and lubuntu desktop added to enable Bat to run and I was surprised at just how well it worked overall (apart from of course not being to get the status of the bacula-fd on the same host).

Revision history for this message
Robie Basak (racb) wrote :

Thank you for your detailed investigations.

As this bug appears to manifest only when armhf is involved, I'm marking it as Importance: Low due to "unusual end-user configurations or uncommon hardware" from https://wiki.ubuntu.com/Bugs/Importance

Unfortunately I don't expect anyone to spend time on it soon due to other bugs having higher priorities. But if you manage to figure it out and give us a patch, we can help you get a fix landed.

Revision history for this message
Robie Basak (racb) wrote :

(if I'm wrong about the armhf thing, please let us know)

Revision history for this message
Karl Stenerud (kstenerud) wrote :

Thanks Richard!

From the testing so far, we have:

[Director FD Bat]
[A A A] FAIL
[A A X] FAIL
[A X A] (not tested)
[A X X] works
[X A A] (not tested)
[X A X] FAIL
[X X A] (not tested)
[X X X] works

From the look of things, the critical element appears to be bacula-fd running on armhf.

Revision history for this message
Richard Neighbour (richard-neighbour) wrote :

Yes, looks to me like it's armhf build of bacula-fd that's the issue - 7.0.5 worked fine and bconsole provides a workaround so it's not a showstopper. FWIW, A-X-A on Karl's list works too.

I'm happy to test and poke around at OS level, but as for providing patches, no chance. I haven't touched any C variants since 1992 and have no intention of going back to real programming any time soon. Under protest (and suitable disclaimers) I have been known to mess with bash, SQL and python in recent years if a PM is truly desperate.

For anyone else reading, a Raspberry Pi and a couple of portable USB drives makes for a nice little small/home office backup solution, for a lot less than the price of an LTO drive too.

And I've just found that Bat runs nicely over SSH + X11 tunnel so this has been a useful exercise for me.

Cheers,
=Richard

Changed in bacula (Ubuntu):
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Radoslaw Korzeniewski (radoslaw) wrote :

Hello, could someone attach a Bacula *.traceback generated by this sigsegv, please. Thanks.

Revision history for this message
Richard Neighbour (richard-neighbour) wrote :

Sorry, just rediscovered this. Attached a whole bunch of tracebacks for you.

=R

Revision history for this message
Radoslaw Korzeniewski (radoslaw) wrote :

Thanks, unfortunate the traceback is not working properly for your environment:

/usr/sbin/btraceback: 60: /usr/sbin/btraceback: gdb: not found
(...)

Could you install a gdb and resend a newly generated traceback files?

Thanks.

Revision history for this message
Richard Neighbour (richard-neighbour) wrote :

GDB installed and it's now showing some more info.
Config:
  Selene (Bacula host) has just been upgraded to Disco, Bacula now version 9.4.2.
  Bat running from Selene via ssh tunnel (ssh -X userxxx@selene)
  Client -> Status Client causes FD to fail, Bat still reports new version of client and sometimes gets some data in header tab.

root@selene:/var/lib/bacula# cat bacula.3064.traceback
[New LWP 3065]
[New LWP 3069]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
/etc/bacula/scripts/btraceback.gdb:1: Error in sourced command file:
'fail_time' has unknown type; cast it to its declared type
[Inferior 1 (process 3064) detached]
Attempt to dump current JCRs. njcrs=1
threadid=0x75cb7450 JobId=0 JobStatus=C jcr=0x75300488 name=-Console-.2019-04-25_22.01.33_02
 use_count=2 killable=1
 JobType=I JobLevel=
 sched_time=25-Apr-2019 22:08 start_time=25-Apr-2019 22:08
 end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00
 db=(nil) db_batch=(nil) batch_started=0
List plugins. Hook count=0

root@selene:/var/lib/bacula# cat bacula.3040.traceback
[New LWP 3041]
[New LWP 3048]
[New LWP 3049]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
/etc/bacula/scripts/btraceback.gdb:1: Error in sourced command file:
'fail_time' has unknown type; cast it to its declared type
[Inferior 1 (process 3040) detached]
Attempt to dump current JCRs. njcrs=1
threadid=0x752ff450 JobId=0 JobStatus=C jcr=0x749005d0 name=-Console-.2019-04-25_22.01.33_02
 use_count=2 killable=1
 JobType=I JobLevel=
 sched_time=25-Apr-2019 22:07 start_time=25-Apr-2019 22:07
 end_time=01-Jan-1970 01:00 wait_time=01-Jan-1970 01:00
 db=(nil) db_batch=(nil) batch_started=0
List plugins. Hook count=0

Hope this helps, I should have some more time over the next few months to poke around at things too.
-R

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.