tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume failing when using the Q35 machine type

Bug #1832248 reported by Lee Yarwood
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Alexandre arents

Bug Description

Description
===========

tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume is failing when using the Q35 machine type as configured as part of the following DNM test change:

DNM: Run tempest-full-py3 with q35 machine type
https://review.opendev.org/#/c/662887/

libvirtd appears to be receiving a DEVICE_DELETED event from QEMU just after the SCSI rescan and well before we attempt to block resize the disk within the domain:

Instance UUID: b3b8394a-866c-441f-b792-14d3b7da464c
Domain name: instance-00000055

http://logs.openstack.org/87/662887/19/check/tempest-full-py3/e0caad1/controller/logs/screen-n-cpu.txt.gz?#_Jun_09_14_51_08_780983

http://logs.openstack.org/87/662887/19/check/tempest-full-py3/e0caad1/controller/logs/libvirt/libvirtd_log.txt.gz#_2019-06-09_14_51_08_215

http://logs.openstack.org/87/662887/19/check/tempest-full-py3/e0caad1/controller/logs/screen-n-cpu.txt.gz?#_Jun_09_14_51_20_840546

We currently end up waiting for 12 seconds here as os-brick is attempting to find a mpath device, even when use_multipath=False. I've created the following bug for this issue and proposed a change:

find_multipath_device_path being called needlessly by linuxscsi.extend_volume
https://bugs.launchpad.net/os-brick/+bug/1832247

FWIW this works around the issue locally for me by calling for a block resize before QEMU has a chance to raise the DELETED_DEVICE notification to libvirtd.

Steps to reproduce
==================
* Use the Q35 machine type

  [libvirt]\hw_machine_type = x86_64=q35

* Run the test_extend_attached_volume test

  $ tempest run --regex tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume

Expected result
===============
Test passes.

Actual result
=============
Test fails.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   1316c1c2850d2f966f335b628f7f5fe88cef611c

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM
   qemu-system-x86 1:2.11+dfsg-1ubuntu7.14
   libvirt0:amd64 4.0.0-1ubuntu8.10

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   LVM/iSCSI

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

2019-06-09 14:51:08.215+0000: 22679: debug : qemuMonitorJSONIOProcessLine:193 : Line [{"timestamp": {"seconds": 1560091868, "microseconds": 215572}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/virtio-disk1/virtio-backend"}}]

Jun 09 14:51:20.840546 ubuntu-bionic-rax-dfw-0007351839 nova-compute[18218]: ERROR nova.virt.libvirt.driver [req-81eae1ea-9bab-470b-8436-0c66701368b4 req-3be591ea-bcb2-44a6-bb9d-85adae6ca3c0 service nova] [instance: b3b8394a-866c-441f-b792-14d3b7da464c] resizing block device failed.: libvirt.libvirtError: invalid argument: invalid path: /dev/sda

Lee Yarwood (lyarwood)
summary: tempest.api.volume.test_volumes_extend.VolumesExtendAttachedTest.test_extend_attached_volume
- failing when usinug the Q35 machine type
+ failing when using the Q35 machine type
Lee Yarwood (lyarwood)
tags: added: libvirt
tags: added: qemu volumes
Revision history for this message
Matt Riedemann (mriedem) wrote :

Is the plan to just get https://review.opendev.org/#/c/664418/ released and bump nova's minimum required version of os-brick that it depends on and consider this nova bug fixed? Or are there nova changes to make as well?

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Lee Yarwood (lyarwood) wrote :

No, as discussed that's a separate bug that just happened to workaround what is an underlying QEMU / guest OS issue when using the q35 machine type. I'd like to leave this open as we might need to address this either in openstack/nova, openstack/tempest or both in the future if we ever default to q35.

Revision history for this message
Alexandre arents (aarents) wrote :

We have a similar issue in operation q35/qemu-4.0/libvirt-5.4.0,
but for interface attachment, not able to reproduce easily for now.
Same behavior, attachment seems ok from libvirt/nova,
then libvirt receive "DEVICE_DELETED" event from qemu.

This result in guest persistent config still containing interface
(virsh dumpxml instance --inactive) but no more on active one.

Is the tempest test was always failling in an devstack isolated run (not in CI)?
Because I'm not able to reproduce the issue on master using ubuntu focal (qemu-4.2.3/libvirt-6.0)
and disabling the workaround fix with this change:
/usr/local/lib/python3.8/dist-packages/os_brick/initiator/linuxscsi.py:
602 #if use_multipath:
603 if True:

Revision history for this message
Alexandre arents (aarents) wrote :

I was able to reproduce on a devstack bionic:
local.conf:TARGET_BRANCH=stable/ussuri
nova-cpu.conf:[libvirt]\hw_machine_type = x86_64=q35

And comment out os-brick patch workaround:
/usr/local/lib/python3.8/dist-packages/os_brick/initiator/linuxscsi.py:
602 #if use_multipath:
603 if True:

bionic comes with 3 qemu release:
1:4.0+dfsg-0ubuntu9.8~cloud0
1:2.11+dfsg-1ubuntu7.32
1:2.11+dfsg-1ubuntu7
test is failling with all 3 releases.

while bisecting qemu from 4.0 to 4.2.1 it start to work
with this commit:

2841ab435bca9f102311e01bf157d5fa878935dc is the first bad commit
commit 2841ab435bca9f102311e01bf157d5fa878935dc
Author: Michael S. Tsirkin <email address hidden>
Date: Fri Jun 21 00:12:22 2019 -0400

 pcie: check that slt ctrl changed before deleting

 During boot, linux would sometimes overwrites control of a powered off
 slot before powering it on. Unfortunately QEMU interprets that as a
 power off request and ejects the device.
....

~/qemu$ git branch -a --contains 2841ab435bca9f102311e01bf157d5fa878935dc
  master
  remotes/origin/master
  remotes/origin/stable-4.1
  remotes/origin/stable-4.2
  remotes/origin/stable-5.0

That's why it is fixed since focal (qemu-4.2.3)

Changed in nova:
assignee: nobody → Alexandre arents (aarents)
Changed in nova:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.