Comment 7 for bug 1706083

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/488959
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1797d73efc0601a0a664d32a127669e93bce3d45
Submitter: Jenkins
Branch: stable/newton

commit 1797d73efc0601a0a664d32a127669e93bce3d45
Author: Kashyap Chamarthy <email address hidden>
Date: Thu Jul 20 19:01:23 2017 +0200

    libvirt: Post-migration, set cache value for Cinder volume(s)

    This was noticed in a downstream bug when a Nova instance with Cinder
    volume (in this case, both the Nova instance storage _and_ Cinder volume
    are located on Ceph) is migrated to a target Compute node, the disk
    cache value for the Cinder volume gets changed. I.e. the QEMU
    command-line for the Cinder volume stored on Ceph turns into the
    following:

    Pre-migration, QEMU command-line for the Nova instance:

        [...] -drive file=rbd:volumes/volume-[...],cache=writeback

    Post-migration, QEMU command-line for the Nova instance:

        [...] -drive file=rbd:volumes/volume-[...],cache=none

    Furthermore, Jason Dillaman from Ceph confirms RBD cache being enabled
    pre-migration:

        $ ceph --admin-daemon /var/run/qemu/ceph-client.openstack.[...] \
            config get rbd_cache
        {
            "rbd_cache": "true"
        }

    And disabled, post-migration:

        $ ceph --admin-daemon /var/run/qemu/ceph-client.openstack.[...] \
            config get rbd_cache
        {
            "rbd_cache": "false"
        }

    This change in cache value post-migration causes I/O latency on the
    Cinder volume.

    From a chat with Daniel Berrangé on IRC: Prior to live migration, Nova
    rewrites all the <disk> elements, and passes this updated guest XML
    across to target libvirt. And it is never calling _set_cache_mode()
    when doing this. So `nova.conf`'s `writeback` setting is getting lost,
    leaving the default `cache=none` setting. And this mistake (of leaving
    the default cache value to 'none') will of course be correct when you
    reboot the guest on the target later.

    So:

      - Call _set_cache_mode() in _get_volume_config() method -- because it
        is a callback function to _update_volume_xml() in
        nova/virt/libvirt/migration.py.

      - And remove duplicate calls to _set_cache_mode() in
        _get_guest_storage_config() and attach_volume().

      - Fix broken unit tests; adjust test_get_volume_config() to reflect
        the disk cache mode.

    Thanks: Jason Dillaman of Ceph for observing the change in cache modes
            in a downstream bug analysis, Daniel Berrangé for help in
            analysis from a Nova libvirt driver POV, and Stefan Hajnoczi
            from QEMU for help on I/O latency instrumentation with `perf`.

    Conflicts [stable/newton]:
     - libvirt/driver.py: The _get_scsi_controller() method from Git master
       isn't in Newton, so adjust the _get_guest_storage_config() method
       accordingly.
     - Fix unit test conflicts in the method
       test_attach_volume_with_vir_domain_affect_live_flag().

    Closes-bug: 1706083
    Change-Id: I4184382b49dd2193d6a21bfe02ea973d02d8b09f
    (cherry picked from commit 14c38ac0f253036da79f9d07aedf7dfd5778fde8)