Charm option to enable debugging of the client side of ceph in libvirt / nova

Bug #1961839 reported by Fabio Augusto Miranda Martins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
Triaged
Wishlist
Unassigned

Bug Description

There are situations where you might want to enable extra rbd debug, during nova-compute troubleshooting scenarios. The process for doing this is documented at https://docs.ceph.com/en/latest/rbd/libvirt/#configuring-ceph in the "Tip" box, basically suggesting you to add the following section to /etc/ceph/ceph.conf:

[client.libvirt]
log file = /var/log/ceph/qemu-guest-$pid.log
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok

Overall, this is pretty straightforward, however that might be tricker in a production environment where the charm is maintaining the ceph.conf file, and that there's also apparmor security handling where the libvirt/qemu process would be able to write to.

As an example, here are some details of what I had to do in order to make this work in a Lab environment:

1. Add the following entries under the [global] section (although this could go into [client] section, per the ceph website):

admin socket = /tmp/$name.$pid.asok
log file = /tmp/qemu-guest-$pid.log

At this point, I had many issues with apparmor preventing me from creating the files under /tmp. I fixed it by:

2. Change /etc/apparmor.d/abstractions/libvirt-qemu and include:

  /tmp/ rw,
  /tmp/* rw,
  /etc/ceph/ r,
  /etc/ceph/ceph.client.nova-compute.keyring r,

Just FYI, I added mine under the following section:

  # Various functions will need to enumerate /tmp (e.g. ceph), allow the base
  # dir and a few known functions like samba support.
  # We want to avoid to give blanket rw permission to everything under /tmp,
  # users are expected to add site specific addons for more uncommon cases.
  # Qemu processes usually all run as the same users, so the "owner"
  # restriction prevents access to other services files, but not across
  # different instances.
  # This is a tradeoff between usability and security - if paths would be more
  # predictable that would be preferred - at least for write rules we would
  # want more unique paths per rule.
  /{,var/}tmp/ r,
  owner /{,var/}tmp/**/ r,
  /tmp/ rw,
  /tmp/* rw,
  /etc/ceph/ r,
  /etc/ceph/ceph.client.nova-compute.keyring r,

Note, initially I had only added /tmp/ and /tmp/*, which allowed libvirt to create the log and asok files under /tmp, however, the only message I had in my log file was "auth: unable to find a keyring on /etc/ceph/ceph.client.nova-compute.keyring: (13) Permission denied", that's why I added the entries to /etc/ceph/ and /etc/ceph/ceph.client.nova-compute.keyring.

3. After that, I just needed to "openstack server stop" and then "openstack server start" the VM, and that created the asok and log files. This is required because the stop/start process using openstack commands will actually recreate (undefine/define) the VM, and the UUID will change, and so will the /etc/apparmor.d/libvirt/libvirt-9257e73d-aaff-4ec4-8f61-096ec1dc5569* files.

Note: Although the files were created, there was nothing being logged in the /tmp/qemu-guest-$pid.log file. I've then changed the ceph.conf file to add some debug, so I can check if this was actually working (and it was):

admin socket = /tmp/$name.$pid.asok
log file = /tmp/qemu-guest-$pid.log
debug rbd = 20
debug rbd mirror = 20
debug rbd replay = 20

After stop/start the VM again, I can see the debug entries:

root@juju-ee1a70-openstack-13:/etc/apparmor.d# tail -f /tmp/qemu-guest-10374.log
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::ImageRequestWQ: 0x55f324bc4610 unblock_overlapping_io: ictx=0x55f324db39b0
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::ImageRequestWQ: 0x55f324bc4610 remove_in_flight_write_ios: ictx=0x55f324db39b0
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::ImageRequestWQ: 0x55f324bc4610 unblock_flushes: ictx=0x55f324db39b0
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::ObjectDispatcher: 0x55f324cb7d80 send: object_dispatch_spec=0x7f1628012fa0
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::SimpleSchedulerObjectDispatch: 0x7f1620013de0 flush:
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::SimpleSchedulerObjectDispatch: 0x7f1620013de0 dispatch_all_delayed_requests:
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::ObjectDispatcher: 0x55f324cb7d80 send: object_dispatch_spec=0x7f1628012fa0
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::AioCompletion: 0x55f32560c910 complete_request: cb=1, pending=0
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::AioCompletion: 0x55f32560c910 finalize: r=0
2022-02-22T17:24:27.396+0000 7f163cff9700 20 librbd::io::AsyncOperation: 0x55f32560ca08 finish_op

And the admin socket also works:

root@juju-ee1a70-openstack-13:/etc/apparmor.d# ceph --admin-daemon /tmp/client.nova-compute.10374.asok config show | grep debug_rbd
    "debug_rbd": "20/20",
    "debug_rbd_mirror": "20/20",
    "debug_rbd_replay": "20/20",
    "debug_rbd_rwl": "0/5",

It would be great if the charm could handle this process, in order to ease the debug/troubleshooting process.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

This would be a very good addition to have for debugging scenarios involving the librbd clients.

Changed in charm-nova-compute:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I wonder if this should be an action, since its unlikely you'll want to enable this across a fleet of compute nodes at the same time, since there's both performance and disk usage impacts of enabling this.

Maybe something like the following:

juju run-action nova-compute enable-rbd-debug debug_rbd=20/20 ...

juju run-action nova-compute disable-rbd-debug

Of course, the challenge is how to handle this without a guest restart. The logging levels could be dynamically changed of course via the admin socket. Implementing this will need to consider log rotation as well in order to avoid running out of disk space when capturing debug data. You may end up with less debug than intended, however its best not to take the host out with it either.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.