Spurious SEGV running inside kvm

Bug #1749247 reported by Howard Chu
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
openldap (Ubuntu)
Expired
Undecided
Unassigned
qemu (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Running a continuous stream of operations against OpenLDAP slapd eventually causes a SEGV in liblber, in a segment of code that cannot fail:

 gdb /opt/symas/lib64/slapd CoreDump
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/symas/lib64/slapd...done.
[New LWP 5472]
[New LWP 5468]
[New LWP 5524]
[New LWP 5471]
[New LWP 5469]
[New LWP 5507]
[New LWP 5510]
[New LWP 5470]
[New LWP 5506]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/symas/lib64/slapd -u root -g root -h ldap:///'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f4e2c9f0160 in ber_dupbv_x (dst=0x196b268, src=0x7f25f8001070, ctx=0x0) at /home/build/git/sold-master/openldap/libraries/liblber/memory.c:513
513 new->bv_val[src->bv_len] = '\0';
[Current thread is 1 (Thread 0x7f260e242700 (LWP 5472))]
(gdb) l 500
495 if(( new = ber_memalloc_x( sizeof(struct berval), ctx )) == NULL ) {
496 return NULL;
497 }
498 }
499
500 if ( src->bv_val == NULL ) {
501 new->bv_val = NULL;
502 new->bv_len = 0;
503 return new;
504 }
(gdb)
505
506 if(( new->bv_val = ber_memalloc_x( src->bv_len + 1, ctx )) == NULL ) {
507 if ( !dst )
508 ber_memfree_x( new, ctx );
509 return NULL;
510 }
511
512 AC_MEMCPY( new->bv_val, src->bv_val, src->bv_len );
513 new->bv_val[src->bv_len] = '\0';
514 new->bv_len = src->bv_len;
(gdb) p *new
$1 = {bv_len = 0, bv_val = 0x0}
(gdb) p *src
$2 = {bv_len = 36, bv_val = 0x7f268ccc7bee <error: Cannot access memory at address 0x7f268ccc7bee>}
(gdb)

At line 506 we allocate some memory and check for a failure (returning NULL) and leave the function at line 509 if there was a failure. The allocation is for 37 bytes of memory and a memcpy into that memory succeeds on line 512. The SEGV occurs at line 513 and the pointer that was just returned from the allocator is NULL at this point. There are no other active threads that could be stomping on memory, there's no stack overrun or any other misbehavior that can account for it. Also, the identical test sequence completes without incident when running on the host OS instead of under kvm.
(The src->bv_val pointer points to valid data at the time of the crash; it's just residing in a mmap'd file and that mapping isn't preserved in the coredump. So ignore gdb's error there.)

Something in kvm is writing zeroes over a field of memory after we already checked that it was non-zero.

This is on
Linux anvil1 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Both the host and the guest VM are on identical OS revision.

Tags: bot-comment
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu:
status: New → Confirmed
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1749247/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Howard Chu (hyc)
affects: ubuntu → kvm (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

slapd is part of the openldap package - assigning this makes more sense I think.

affects: kvm (Ubuntu) → openldap (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Setting qemu(Ubuntu) which is the right package for the question of "the identical test sequence completes without incident when running on the host OS instead of under kvm".

Although that could just be timing and doesn't "have to be" a kvm memory clobbering.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

How reproducible is this - everytime or just once in a number of retry loops?
Could you share the minimal simplified setup+loop code to retrigger this over here?

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Eventually as you already found the question is how did bv_val get 0x0?

If the test can't be passed to me, but is reproducible, could you try to step live from line 506.
1. is new->bv_val really assigned some pointer (and which one)
2. when does that pointer get lost between 506 and 513

Also the memcopy seems to work (no crash on that line).
If you can modify and retest you could take a look to check if AC_MEMCPY in your case actually is memmove, bcopy or memcpy - maybe even iterate between those to be sure.

All of the copies return a pointer to dest which should still be what was assigned to new->bv_val, so is it still the same or did it return something else?

Revision history for this message
Howard Chu (hyc) wrote :

Will try to come up with a minimal reproducer. Currently it takes several hours to run the complete test, and a few hours before the SEGV occurs. But the stack trace is always identical when it happens. In multiple runs, it always succeeds on the host and always fails in the VM.

Revision history for this message
Quanah (quanah-zimbra) wrote : Re: [Bug 1749247] Re: Spurious SEGV running inside kvm

It can also die with the following backtrace:
(gdb) bt
#0 __memmove_ssse3 () at ../sysdeps/x86_64/multiarch/memcpy-ssse3.S:2331
#1 0x00007f570da4a14e in ber_dupbv_x (dst=0x25f55e8, src=0x7f2ee8001200,
ctx=0x0) at
/home/build/git/sold-2445/openldap/libraries/liblber/memory.c:512
#2 0x00007f570da4a19f in ber_dupbv (dst=0x25f55e8, src=0x7f2ee8001200) at
/home/build/git/sold-2445/openldap/libraries/liblber/memory.c:523
#3 0x00007f57091a07b4 in accesslog_op_mod (op=0x7f2f0518e780,
rs=0x7f2f0518e030) at
/home/build/git/sold-2445/openldap/servers/slapd/overlays/accesslog.c:2031
#4 0x00000000004d9be5 in overlay_op_walk (op=0x7f2f0518e780,
rs=0x7f2f0518e030, which=op_modify, oi=0x25eb720, on=0x25f53c0)
    at /home/build/git/sold-2445/openldap/servers/slapd/backover.c:661
#5 0x00000000004d9ee1 in over_op_func (op=0x7f2f0518e780,
rs=0x7f2f0518e030, which=op_modify) at
/home/build/git/sold-2445/openldap/servers/slapd/backover.c:730
#6 0x00000000004da044 in over_op_modify (op=0x7f2f0518e780,
rs=0x7f2f0518e030) at
/home/build/git/sold-2445/openldap/servers/slapd/backover.c:769
#7 0x00000000004cc061 in syncrepl_updateCookie (si=0x26499f0,
op=0x7f2f0518e780, syncCookie=0x7f2f0518e290) at
/home/build/git/sold-2445/openldap/servers/slapd/syncrepl.c:3982
#8 0x00000000004c0bd5 in do_syncrep2 (op=0x7f2f0518e780, si=0x26499f0) at
/home/build/git/sold-2445/openldap/servers/slapd/syncrepl.c:1019
#9 0x00000000004c2ea6 in do_syncrepl (ctx=0x7f2f0518ec30, arg=0x25ebb40)
at /home/build/git/sold-2445/openldap/servers/slapd/syncrepl.c:1567
#10 0x000000000043dbed in connection_read_thread (ctx=0x7f2f0518ec30,
argv=0x11) at
/home/build/git/sold-2445/openldap/servers/slapd/connection.c:1296
#11 0x00007f570dc6a909 in ldap_int_thread_pool_wrapper (xpool=0x24e8500) at
/home/build/git/sold-2445/openldap/libraries/libldap_r/tpool.c:963
#12 0x00007f570d8276ba in start_thread (arg=0x7f2f0518f700) at
pthread_create.c:333
#13 0x00007f570c86f41d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

On Wed, Feb 14, 2018 at 6:45 AM, Howard Chu <email address hidden> wrote:

> Will try to come up with a minimal reproducer. Currently it takes
> several hours to run the complete test, and a few hours before the SEGV
> occurs. But the stack trace is always identical when it happens. In
> multiple runs, it always succeeds on the host and always fails in the
> VM.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1749247
>
> Title:
> Spurious SEGV running inside kvm
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/openldap/+bug/
> 1749247/+subscriptions
>

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, this is just at the memcopy above.
I see in this trace that the AC_MEMCPY got mapped to __memmove_ssse3 of ../sysdeps/x86_64/multiarch/memcpy-ssse3.S

Lets assume (for the sake of trying something until you have a simplified reproducer) that sse3 might be broken in KVM on your system.
I'd expect that the compile time detection switched to memmove and the runtime feature detection found sse3 and used that.
We could do two things to try here:
1. we could modify your guest to not expose sse3 and retest in there (-cpu qemu64/qemu32 will be without sse3 and without a lot other special HW features - if you run with -cpu host you can do "-enable-kvm -cpu host,-sse3" to just kill this feature). How do you start your KVM guest (commandline, libvirt, other - so we know how to help you modify?
2. we could try to influence the build to not use memmove but memcpy or bcopy instead and check if that makes it work.

Revision history for this message
Leo Yuriev (erthink) wrote :

I suggest that the problem does not reproduce on https://github.com/leo-yuriev/ReOpenLDAP.

Please provide a testcase, I will check.

Under kvm gettimeofday() is costly, so I suggest this just expose one of race condition in the slapd.

LDAP_JITTER maybe usefull to reproduce bug without kvm.
https://github.com/leo-yuriev/ReOpenLDAP/blob/devel/include/reldap.h#L245

Revision history for this message
Howard Chu (hyc) wrote :

I think we can close this out, we were able to reproduce the issue without kvm. Sorry for the false alarm.

Changed in qemu (Ubuntu):
status: Incomplete → Invalid
Changed in openldap (Ubuntu):
status: Confirmed → Invalid
status: Invalid → In Progress
Revision history for this message
David Britton (dpb) wrote :

Hi Howard, I assigned this to you since you marked it as 'in progress' (as is convention). If that is not correct, you can unassign yourself and set it back to 'triaged' or 'confirmed' to allow someone else to pick it up.

Thanks!

Changed in openldap (Ubuntu):
assignee: nobody → Howard Chu (hyc)
Changed in openldap (Ubuntu):
status: In Progress → Incomplete
assignee: Howard Chu (hyc) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for openldap (Ubuntu) because there has been no activity for 60 days.]

Changed in openldap (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.