Bug #1607694 “Exclusive-Lock Issue” : Bugs : ceph package : Ubuntu

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-07-31:

#1

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu):
status:	New → Confirmed

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-07-31:

#2

We have had this on two VMs in the last 2 days which run on ubuntu 12.04 and 14.04 via virtio-scsi with 8 queues. Just around the time they were killed, also unattended-upgrades were running on the host.

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-08-04:

#3

We had this again last night. It hit the 14.04 machine again. This bug seems to prefer this special VM...

librbd/ExclusiveLock.cc: In function 'std::__cxx11::string librbd::ExclusiveLock<ImageCtxT>::encode_lock_cookie() const [with ImageCtxT = librbd::ImageCtx; std::__cxx11::string = std::__cxx11::basic_string<char>]' thread 7efe88abfc00 time 2016-08-03 22:40:44.069765
librbd/ExclusiveLock.cc: 217: FAILED assert(m_watch_handle != 0)
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x28d5d0) [0x7efe703335d0]
2: (()+0x8c785) [0x7efe70132785]
3: (()+0x8deab) [0x7efe70133eab]
4: (()+0x8d6c7) [0x7efe701336c7]
5: (()+0x74167) [0x7efe7011a167]
6: (()+0x7456d) [0x7efe7011a56d]
7: (rbd_aio_write()+0x4c) [0x7efe701067fc]
8: (()+0x2359) [0x7efe7999c359]
9: (()+0x241b) [0x7efe7999c41b]
10: (()+0x495ada) [0x564d75434ada]
11: (()+0x49809d) [0x564d7543709d]
12: (()+0x498a87) [0x564d75437a87]
13: (()+0x4633b6) [0x564d754023b6]
14: (()+0x49809d) [0x564d7543709d]
15: (()+0x498a87) [0x564d75437a87]
16: (()+0x499730) [0x564d75438730]
17: (()+0x4dd5ca) [0x564d7547c5ca]
18: (()+0x4a590) [0x7efe83724590]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Might this correlate with creating snapshots or reading from them? Because 4 seconds prior to this error, we took a snapshot of this block device and created a rbd diff --whole-diff between the old snapshot and the new snapshot...

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-04:

#4

@Luis or Daniel: Are you seeing a "image watch failed" error message in your QEMU librbd logs?

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-08-04:

#5

@jdillaman nope, there's no such entry.

Revision history for this message

Luis (luis-ramirez) wrote on 2016-08-04: Re: [Bug 1607694] Re: Exclusive-Lock Issue

#6

Download full text (5.1 KiB)

No.we don't have any message like that.

On 4 Aug 2016 13:55, "Jason Dillaman" <email address hidden> wrote:

> @Luis or Daniel: Are you seeing a "image watch failed" error message in
> your QEMU librbd logs?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1607694
>
> Title:
> Exclusive-Lock Issue
>
> Status in ceph package in Ubuntu:
> Confirmed
>
> Bug description:
> After few days running a VM over RBD, under heavy loads. We get freeze
> the VMs and get this log:
>
> 2016-07-26 16:20:47.069+0000: starting up libvirt version: 1.3.1,
> package: 1ubuntu10~cloud0 (Openstack Ubuntu Testing Bot <
> <email address hidden>> Fri, 15 Apr 2016 08:05:47 +0000), qemu
> version: 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.24), hostname:
> tckvmhv01.todoencloud
> LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
> QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name opennebula -S -machine
> pc-i440fx-trusty,accel=kvm,usb=off -m 8192 -realtime mlock=off -smp
> 4,sockets=4,cores=1,threads=1 -uuid 7733f4c6-c72d-af36-2e84-cbc44c5c8da0
> -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/
> var/lib/libvirt/qemu/domain-opennebula/monitor.sock,server,nowait -mon
> chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
> -boot order=c,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
> -drive 'file=rbd:sata/opennebula.img:id=sata:key=AQB69U1Xq+
> bxHhAAcw7ZWZlOjE0Bf88QMsTqMQ==:auth_supported=cephx\;none:
> mon_host=10.255.128.238\:6789,format=raw,if=none,id=drive-virtio-disk0'
> -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -drive file=/usr/src/systemrescuecd.iso,format=raw,if=none,id=drive-ide0-1-0,readonly=on
> -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev
> tap,fd=28,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=
> hostnet0,id=net0,mac=52:54:00:1c:07:7f,bus=pci.0,addr=0x3 -netdev
> tap,fd=35,id=hostnet1,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=
> hostnet1,id=net1,mac=52:54:00:42:b6:74,bus=pci.0,addr=0x6 -netdev
> tap,fd=37,id=hostnet2,vhost=on,vhostfd=38 -device virtio-net-pci,netdev=
> hostnet2,id=net2,mac=52:54:00:2b:bb:89,bus=pci.0,addr=0x7 -chardev
> pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0
> -device usb-tablet,id=input0 -vnc 127.0.0.1:2 -device
> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
> -msg timestamp=on
> char device redirected to /dev/pts/1 (label charserial0)
> librbd/ExclusiveLock.cc: In function 'std::string librbd::ExclusiveLock<
> ImageCtxT>::encode_lock_cookie() const [with ImageCtxT =
> librbd::ImageCtx; std::string = std::basic_string<char>]' thread
> 7fded06a2a00 time 2016-07-28 23:44:09.845595
> librbd/ExclusiveLock.cc: 197: FAILED assert(m_watch_handle != 0)
> ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
> 1: (()+0x24c4bb) [0x7fdebcbde4bb]
> 2: (()+0x7cf54) [0x7fdebca0ef54]
> 3: (()+0x7dadb) [0x7fdebca0fadb]
> 4: (()+0x7e827) [0x7fdebca10827]
> 5: (()+0x65d46) [0...

No.we don't have any message like that.

On 4 Aug 2016 13:55, "Jason Dillaman" <dillaman@redhat.com> wrote:

> @Luis or Daniel: Are you seeing a "image watch failed" error message in
> your QEMU librbd logs?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1607694
>
> Title:
>   Exclusive-Lock Issue
>
> Status in ceph package in Ubuntu:
>   Confirmed
>
> Bug description:
>   After few days running a VM over RBD, under heavy loads. We get freeze
>   the VMs and get this log:
>
>   2016-07-26 16:20:47.069+0000: starting up libvirt version: 1.3.1,
> package: 1ubuntu10~cloud0 (Openstack Ubuntu Testing Bot <
> openstack-testing-bot@ubuntu.com> Fri, 15 Apr 2016 08:05:47 +0000), qemu
> version: 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.24), hostname:
> tckvmhv01.todoencloud
>   LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
> QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name opennebula -S -machine
> pc-i440fx-trusty,accel=kvm,usb=off -m 8192 -realtime mlock=off -smp
> 4,sockets=4,cores=1,threads=1 -uuid 7733f4c6-c72d-af36-2e84-cbc44c5c8da0
> -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/
> var/lib/libvirt/qemu/domain-opennebula/monitor.sock,server,nowait -mon
> chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
> -boot order=c,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
> -drive 'file=rbd:sata/opennebula.img:id=sata:key=AQB69U1Xq+
> bxHhAAcw7ZWZlOjE0Bf88QMsTqMQ==:auth_supported=cephx\;none:
> mon_host=10.255.128.238\:6789,format=raw,if=none,id=drive-virtio-disk0'
> -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -drive file=/usr/src/systemrescuecd.iso,format=raw,if=none,id=drive-ide0-1-0,readonly=on
> -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev
> tap,fd=28,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=
> hostnet0,id=net0,mac=52:54:00:1c:07:7f,bus=pci.0,addr=0x3 -netdev
> tap,fd=35,id=hostnet1,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=
> hostnet1,id=net1,mac=52:54:00:42:b6:74,bus=pci.0,addr=0x6 -netdev
> tap,fd=37,id=hostnet2,vhost=on,vhostfd=38 -device virtio-net-pci,netdev=
> hostnet2,id=net2,mac=52:54:00:2b:bb:89,bus=pci.0,addr=0x7 -chardev
> pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0
> -device usb-tablet,id=input0 -vnc 127.0.0.1:2 -device
> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
> -msg timestamp=on
>   char device redirected to /dev/pts/1 (label charserial0)
>   librbd/ExclusiveLock.cc: In function 'std::string librbd::ExclusiveLock<
> ImageCtxT>::encode_lock_cookie() const [with ImageCtxT =
> librbd::ImageCtx; std::string = std::basic_string<char>]' thread
> 7fded06a2a00 time 2016-07-28 23:44:09.845595
>   librbd/ExclusiveLock.cc: 197: FAILED assert(m_watch_handle != 0)
>    ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
>    1: (()+0x24c4bb) [0x7fdebcbde4bb]
>    2: (()+0x7cf54) [0x7fdebca0ef54]
>    3: (()+0x7dadb) [0x7fdebca0fadb]
>    4: (()+0x7e827) [0x7fdebca10827]
>    5: (()+0x65d46) [0x7fdebc9f7d46]
>    6: (()+0x66218) [0x7fdebc9f8218]
>    7: (rbd_aio_write()+0x29) [0x7fdebc9ec629]
>    8: (()+0xde816) [0x7fded0108816]
>    9: (()+0xde910) [0x7fded0108910]
>    10: (()+0xb69b0) [0x7fded00e09b0]
>    11: (()+0xbbc44) [0x7fded00e5c44]
>    12: (()+0xbbc44) [0x7fded00e5c44]
>    13: (()+0xbc8fc) [0x7fded00e68fc]
>    14: (()+0xfc36a) [0x7fded012636a]
>    15: (()+0x49800) [0x7fdeb84ac800]
>    NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>   2016-07-28 21:44:10.296+0000: shutting down
>
>   root@tckvmhv01:/var/log/libvirt/qemu# dpkg -l | grep ceph
>   ii  ceph-common                          10.2.0-0ubuntu0.16.04.1~cloud0
>       amd64        common utilities to mount and interact with a ceph
> storage cluster
>   ii  libcephfs1                           10.2.0-0ubuntu0.16.04.1~cloud0
>       amd64        Ceph distributed file system client library
>   ii  python-ceph                          10.2.0-0ubuntu0.16.04.1~cloud0
>       all          Meta-package for python libraries for the Ceph libraries
>   ii  python-cephfs                        10.2.0-0ubuntu0.16.04.1~cloud0
>       amd64        Python libraries for the Ceph libcephfs library
>
>   root@tckvmhv01:/var/log/libvirt/qemu# dpkg -l | grep libvirt
>   ii  libvirt-bin                          1.3.1-1ubuntu10.1~cloud0
>       amd64        programs for the libvirt library
>   ii  libvirt0:amd64                       1.3.1-1ubuntu10.1~cloud0
>       amd64        library for interfacing with different virtualization
> systems
>   ii  python-libvirt                       1.3.1-1ubuntu1~cloud0
>        amd64        libvirt Python bindings
>
>   root@tckvmhv01:/var/log/libvirt/qemu# uname -a
>   Linux tckvmhv01 3.13.0-86-generic #131-Ubuntu SMP Thu May 12 23:33:13
> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1607694/+subscriptions
>

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-04:

#7

Hmm -- any chance either one of you could reproduce with debug logging enabled? In your ceph configuration file, add the following to the "[client]" section:

debug rbd = 20
log path = /path/that/is/writable/by/qemu/process/ceph-$pid.log

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-08-05:

#8

Sure - but that's a bit heavy because these are production machines ;)
That'll take some time for a maintenance window.
Would this setting be active as soon as I `virsh restart` the machine?

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-05:

#9

As an alternative, could you provide the core dump? You can use ceph-post-file to upload the file.

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-08-10:

#10

I have added

[client]
debug rbd = 20
log path = /tmp/ceph-$pid.log

to /etc/ceph/ceph.conf
but this seems to have no effect for logging. At least, I can't find anything relevant below /tmp.

We're using virsh/libvirt for kvm virtualization on a hybrid configuration (osd hosts == VM hosts). Any hints?

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-11:

#11

I think the config option is "log file" instead of "log path". Also, make sure your QEMU process has access to the directory (both permissions and AppArmor).

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-08-11:

#12

Ok, it's logging now for a machine that died 3 times in the last 3 nights.
All deaths followed directly a backup run, where we do this:

1. create a snapshot
2. create a rbd fast-diff between this snapshot and the 24h older one

The log file is growing *very* fast, so let's see if we can afford to wait until tomorrow night.

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-08-12:

#13

Download full text (4.6 KiB)

Got it. This is the log output that results in the dead kvm process.

2016-08-12 22:29:44.499285 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck start: id=234071422725522, handle=140096867143248
2016-08-12 22:29:44.518673 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db4003b60 image header updated
2016-08-12 22:29:44.518680 7f6dcffff700 20 librbd::ImageState: refresh_seq = 5, last_refresh = 4
2016-08-12 22:29:44.518689 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck finish: id=234071422725522, handle=140096867143248
2016-08-12 22:29:44.573373 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck start: id=234071422725523, handle=140096867143248
2016-08-12 22:29:44.573386 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db4003b60 image header updated
2016-08-12 22:29:44.573389 7f6dcffff700 20 librbd::ImageState: refresh_seq = 6, last_refresh = 4
2016-08-12 22:29:44.573393 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck finish: id=234071422725523, handle=140096867143248
2016-08-12 22:29:45.249895 7f6dcffff700 -1 librbd::ImageWatcher: 0x7f6db4003b60 image watch failed: 140096867143248, (107) Transport endpoint is not connected
2016-08-12 22:29:45.262607 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db4003b60 current lock owner: [0,0]
2016-08-12 22:29:45.319110 7f6dcd6f9700 10 librbd::ImageWatcher: 0x7f6db4003b60 re-registering image watch
2016-08-12 22:29:45.319145 7f6dcd6f9700 10 librbd::ExclusiveLock: 0x7f6db40029f0 release_lock
2016-08-12 22:29:45.319150 7f6dcd6f9700 10 librbd::ExclusiveLock: 0x7f6db40029f0 send_release_lock
2016-08-12 22:29:45.319251 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_cancel_op_requests
2016-08-12 22:29:45.319263 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: handle_cancel_op_requests: r=0
2016-08-12 22:29:45.319267 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_block_writes
2016-08-12 22:29:45.319271 7f6dcf7fe700 5 librbd::AioImageRequestWQ: block_writes: 0x55ee44904f50, num=1
2016-08-12 22:29:45.319285 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: handle_block_writes: r=0
2016-08-12 22:29:45.319352 7f6dcf7fe700 10 librbd::ExclusiveLock: 0x7f6db40029f0 handle_releasing_lock
2016-08-12 22:29:45.319361 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_flush_notifies
2016-08-12 22:29:45.319366 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: handle_flush_notifies
2016-08-12 22:29:45.319368 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_close_object_map
2016-08-12 22:29:45.319378 7f6dcf7fe700 10 librbd::object_map::UnlockRequest: 0x7f6dc407b8c0 send_unlock: oid=rbd_object_map.40c5612ae8944a
2016-08-12 22:29:45.320613 7f6dcffff700 10 librbd::object_map::UnlockRequest: 0x7f6dc407b8c0 handle_unlock: r=-108
2016-08-12 22:29:45.320626 7f6dcffff700 -1 librbd::object_map::UnlockRequest: failed to release object map lock: (108) Cannot send after transport endpoint shutdown
2016-08-12 22:29:45.320639 7f6dcffff700 10 librbd::exclusive_lock::ReleaseRequest: handle_close_object_map: r=0
2016-08-12 22:29:45.320647 7f6dcffff700 10 librbd::exclusive_lock::ReleaseRequest: send_unlock
2016-08-12 22:29:45....

Got it. This is the log output that results in the dead kvm process.

2016-08-12 22:29:44.499285 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck start: id=234071422725522, handle=140096867143248
2016-08-12 22:29:44.518673 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db4003b60 image header updated
2016-08-12 22:29:44.518680 7f6dcffff700 20 librbd::ImageState: refresh_seq = 5, last_refresh = 4
2016-08-12 22:29:44.518689 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck finish: id=234071422725522, handle=140096867143248
2016-08-12 22:29:44.573373 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck start: id=234071422725523, handle=140096867143248
2016-08-12 22:29:44.573386 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db4003b60 image header updated
2016-08-12 22:29:44.573389 7f6dcffff700 20 librbd::ImageState: refresh_seq = 6, last_refresh = 4
2016-08-12 22:29:44.573393 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db400b020 C_NotifyAck finish: id=234071422725523, handle=140096867143248
2016-08-12 22:29:45.249895 7f6dcffff700 -1 librbd::ImageWatcher: 0x7f6db4003b60 image watch failed: 140096867143248, (107) Transport endpoint is not connected
2016-08-12 22:29:45.262607 7f6dcffff700 10 librbd::ImageWatcher: 0x7f6db4003b60 current lock owner: [0,0]
2016-08-12 22:29:45.319110 7f6dcd6f9700 10 librbd::ImageWatcher: 0x7f6db4003b60 re-registering image watch
2016-08-12 22:29:45.319145 7f6dcd6f9700 10 librbd::ExclusiveLock: 0x7f6db40029f0 release_lock
2016-08-12 22:29:45.319150 7f6dcd6f9700 10 librbd::ExclusiveLock: 0x7f6db40029f0 send_release_lock
2016-08-12 22:29:45.319251 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_cancel_op_requests
2016-08-12 22:29:45.319263 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: handle_cancel_op_requests: r=0
2016-08-12 22:29:45.319267 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_block_writes
2016-08-12 22:29:45.319271 7f6dcf7fe700  5 librbd::AioImageRequestWQ: block_writes: 0x55ee44904f50, num=1
2016-08-12 22:29:45.319285 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: handle_block_writes: r=0
2016-08-12 22:29:45.319352 7f6dcf7fe700 10 librbd::ExclusiveLock: 0x7f6db40029f0 handle_releasing_lock
2016-08-12 22:29:45.319361 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_flush_notifies
2016-08-12 22:29:45.319366 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: handle_flush_notifies
2016-08-12 22:29:45.319368 7f6dcf7fe700 10 librbd::exclusive_lock::ReleaseRequest: send_close_object_map
2016-08-12 22:29:45.319378 7f6dcf7fe700 10 librbd::object_map::UnlockRequest: 0x7f6dc407b8c0 send_unlock: oid=rbd_object_map.40c5612ae8944a
2016-08-12 22:29:45.320613 7f6dcffff700 10 librbd::object_map::UnlockRequest: 0x7f6dc407b8c0 handle_unlock: r=-108
2016-08-12 22:29:45.320626 7f6dcffff700 -1 librbd::object_map::UnlockRequest: failed to release object map lock: (108) Cannot send after transport endpoint shutdown
2016-08-12 22:29:45.320639 7f6dcffff700 10 librbd::exclusive_lock::ReleaseRequest: handle_close_object_map: r=0
2016-08-12 22:29:45.320647 7f6dcffff700 10 librbd::exclusive_lock::ReleaseRequest: send_unlock
2016-08-12 22:29:45.321490 7f6dcffff700 10 librbd::exclusive_lock::ReleaseRequest: handle_unlock: r=-108
2016-08-12 22:29:45.321498 7f6dcffff700 -1 librbd::exclusive_lock::ReleaseRequest: failed to unlock: (108) Cannot send after transport endpoint shutdown
2016-08-12 22:29:45.321551 7f6dcf7fe700 10 librbd::ExclusiveLock: 0x7f6db40029f0 handle_release_lock: r=0
2016-08-12 22:29:45.321565 7f6dcf7fe700 10 librbd::ImageWatcher: 0x7f6db4003b60 notify released lock
2016-08-12 22:29:45.321568 7f6dcf7fe700 10 librbd::ImageWatcher: 0x7f6db4003b60 current lock owner: [0,0]
2016-08-12 22:29:45.324256 7f6dcf7fe700 20 librbd::image_watcher::Notifier: notify: pending=1
2016-08-12 22:29:45.896206 7f6dcffff700 20 librbd::image_watcher::Notifier: handle_notify: pending=0
2016-08-12 22:29:45.896273 7f6dcd6f9700 -1 librbd::ImageWatcher: 0x7f6db4003b60 failed to re-register image watch: (108) Cannot send after transport endpoint shutdown
2016-08-12 22:29:53.107299 7f6e051adc00 20 librbd::AioImageRequestWQ: aio_write: ictx=0x55ee44904f50, completion=0x55ee45331a70, off=75430789120, len=8192, flags=0
2016-08-12 22:29:53.107326 7f6e051adc00 20 librbd::AioImageRequestWQ: queue: ictx=0x55ee44904f50, req=0x55ee4593c850
2016-08-12 22:29:53.107334 7f6e051adc00 20 librbd::ExclusiveLock: 0x7f6db40029f0 is_lock_owner=0
2016-08-12 22:29:53.107337 7f6e051adc00 10 librbd::ExclusiveLock: 0x7f6db40029f0 request_lock
2016-08-12 22:29:53.107340 7f6e051adc00 10 librbd::ExclusiveLock: 0x7f6db40029f0 send_acquire_lock

HTH

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-12:

#14

Thanks for the assistance. It appears like the issue you are hitting is due to a failed watch:

2016-08-12 22:29:45.249895 7f6dcffff700 -1 librbd::ImageWatcher: 0x7f6db4003b60 image watch failed: 140096867143248, (107) Transport endpoint is not connected

There is a heartbeat that your client is supposed to send to the OSD every 5 seconds to prevent the watch from timing out after 30 seconds. This is indicative of an overloaded client / cluster.

The good news is that the fix is already available [1] and should be included in the next Ceph point release. This won't prevent the watch failure but should prevent the race condition between failure and recovery.

[1] http://tracker.ceph.com/issues/16923

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-12:

#15

I should also mention that, in the meantime, you may want to disable the exclusive-lock feature on the affected images via the rbd CLI:

rbd feature disable <image-spec> exclusive-lock,object-map,fast-diff

Revision history for this message

Philipp Noack (philipp-noack-b) wrote on 2016-08-17:

#16

Thanks for the possible workaround. But fast-diff is a very important feature for us and unfortunately disabling it is not an options for us. The virtual machine gets killed almost every night at the moment. Is there an estimated time window for the fix in sight?

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-17:

#17

If you are interested, I can build a Jewel 10.2.2 + this fix "development" release package [1] that you can install and verify that it resolves the issue. I am going out of town for a few days so if it's desired, I would need to know ASAP in order to start the build.

[1] http://docs.ceph.com/docs/master/install/get-packages/#add-ceph-development

Revision history for this message

Philipp Noack (philipp-noack-b) wrote on 2016-08-22:

#18

@jdillaman That would be awesome, I'd test this version on our servers to see if it resolves our problems.

Revision history for this message

Jason Dillaman (jdillaman) wrote on 2016-08-23:

#19

The v10.2.2 branch with the cherry-picked fix is available under the 'wip-16950-jewel' branch name. See the instructions above for how to install test/development packages.

Revision history for this message

Christian Theune (ctheune) wrote on 2016-10-20:

#20

As 10.2.3 was released - any news whether this is fixed now? Daniel?

Revision history for this message

Daniel Kraft (daniel-kraft) wrote on 2016-10-20:

#21

Until monday I'd have said: yes, fixed. We have 10.2.3 running for a few weeks now and it looks good.

However, on monday we've had 2 vms in one night that died with the error message 'INFO: task jbd2/sda1-8:171_blocked for more than 120 seconds.'
This was during snapshot create/delete times.

Also, we've had one vm where virsh/qemu must have lost the rbd / ceph connection. Hard to say what happened exactly, but the running VM logged to stdout that there was no sda anymore and a hard reset of the vm had no disk to boot from. A shutdown / start cycle resolved that.

Both errors look like different problems right now, that's why I wasn't reporting anything here. So from that perspective I'd say:

"solved".

Revision history for this message

Christian Theune (ctheune) wrote on 2016-10-20:

#22

Thanks. Interesting note from my side: I'm currently debugging a different hang that we're experiencing with XFS on the inside. See http://tracker.ceph.com/issues/17536 which will be updated with detailed logs soon.

Revision history for this message

James Page (james-page) wrote on 2016-11-15:

#23

Included in the 10.2.3 release which is currently in xenial-proposed for testing so marking Fix Committed.

Changed in ceph (Ubuntu):
status:	Confirmed → Fix Committed
importance:	Undecided → Medium

Revision history for this message

James Page (james-page) wrote on 2017-01-06:

#24

10.2.3 is now release to updates - Marking 'Fix Released'

Changed in ceph (Ubuntu):
status:	Fix Committed → Fix Released

Ubuntu
ceph package

Exclusive-Lock Issue

Bug Description

Other bug subscribers

Remote bug watches

Ubuntuceph package

Exclusive-Lock Issue

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
ceph package