Failed to attach volume - Connection to the hypervisor is broken on host

Bug #1880509 reported by alpha23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned
kolla-ansible
Invalid
Undecided
Unassigned

Bug Description

Using Rocky centos/source containers
Docker version - 18.06.1-ce
Hosts are Ubuntu 18.04.4 LTS

Receiving the following defect on nova_compute which only occurs when attempting to attach an existing and unattached volume to a VM.
- Instances are created on the same host without issue and access to the virtual host is accessible via Virtual Machine Manager. For testing purposes, I was able to attach the same volume via VMM.
- All compute, block, and network services are enabled and up.
- 'nova hypervisor-list' shows the hypervisor up
- The same volume and other volumes attach to VMs on a different hypervisor without issue.
- The only difference between hypervisor hostnames that I see between the one that is failing and the one that is not is that the one that is failing is showing a FQDN. Note below, it specifies only the hostname, not the FQDN.
- From nova_compute, both the host name or the FQDN resolve correctly to the same IP and I can ping either.
- virt_type is kvm
- compute_driver is libvirt.LibvirtDriver
- kernel on both hosts (failing and not failing to attach) are 4.15.0-101-generic

Is this a settings issue, a permissions issue, a driver issue, or something else?

From nova-compute.log:

020-05-24 23:52:54.692 7 DEBUG oslo_concurrency.lockutils [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] Lock "d51fddde-1b0c-4365-b1ec-ad508d2d0bed" released by "nova.compute.manager.do_reserve" :: held 0.154s inner /var/lib/kolla/venv/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2020-05-24 23:52:55.469 7 DEBUG oslo_concurrency.lockutils [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] Lock "d51fddde-1b0c-4365-b1ec-ad508d2d0bed" acquired by "nova.compute.manager.do_attach_volume" :: waited 0.000s inner /var/lib/kolla/venv/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2020-05-24 23:52:55.470 7 INFO nova.compute.manager [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] Attaching volume cadca51f-ff52-4b12-9777-4f86514e3eef to /dev/vdb
2020-05-24 23:52:55.471 7 DEBUG nova.objects.instance [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] Lazy-loading 'flavor' on Instance uuid d51fddde-1b0c-4365-b1ec-ad508d2d0bed obj_load_attr /var/lib/kolla/venv/lib/python2.7/site-packages/nova/objects/instance.py:1111
2020-05-24 23:52:55.738 7 DEBUG os_brick.utils [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] ==> get_connector_properties: call u"{'execute': None, 'my_ip': '192.168.3.252', 'enforce_multipath': True, 'host': 'st1', 'root_helper': 'sudo nova-rootwrap /etc/nova/rootwrap.conf', 'multipath': False}" trace_logging_wrapper /var/lib/kolla/venv/lib/python2.7/site-packages/os_brick/utils.py:146
2020-05-24 23:52:56.428 7 DEBUG os_brick.initiator.linuxfc [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] No Fibre Channel support detected on system. get_fc_hbas /var/lib/kolla/venv/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157
2020-05-24 23:52:56.429 7 DEBUG os_brick.initiator.linuxfc [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] No Fibre Channel support detected on system. get_fc_hbas /var/lib/kolla/venv/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157
2020-05-24 23:52:56.430 7 DEBUG os_brick.utils [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] <== get_connector_properties: return (691ms) {'initiator': u'iqn.1994-05.com.redhat:cddc7297faec', 'ip': u'192.168.3.252', 'platform': u'x86_64', 'host': u'st1', 'do_local_attach': False, 'os_type': u'linux2', 'multipath': False} trace_logging_wrapper /var/lib/kolla/venv/lib/python2.7/site-packages/os_brick/utils.py:170
2020-05-24 23:52:56.432 7 DEBUG nova.virt.block_device [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] Updating existing volume attachment record: 9a69f304-8468-47f6-8b59-5edd1e3dd75f _volume_attach /var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/block_device.py:535
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [req-94fdd34c-7799-4e1b-835e-78ee27236d71 240c5b223f324b5b84471039304fc530 96d55c41d2094f679c37078b034f552e - default default] [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] Driver failed to attach volume cadca51f-ff52-4b12-9777-4f86514e3eef at /dev/vdb: HypervisorUnavailable: Connection to the hypervisor is broken on host: st1
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] Traceback (most recent call last):
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/block_device.py", line 563, in _volume_attach
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] device_type=self['device_type'], encryption=encryption)
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1481, in attach_volume
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] encryption=encryption)
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1264, in _connect_volume
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] vol_driver.connect_volume(connection_info, instance)
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/volume/fs.py", line 117, in connect_volume
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] self._mount_options(connection_info))
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/volume/mount.py", line 409, in mount
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] with __manager__.get_state() as mount_state:
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] return self.gen.next()
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/volume/mount.py", line 89, in get_state
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] raise exception.HypervisorUnavailable(host=CONF.host)
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed] HypervisorUnavailable: Connection to the hypervisor is broken on host: st1
2020-05-24 23:53:09.078 7 ERROR nova.virt.block_device [instance: d51fddde-1b0c-4365-b1ec-ad508d2d0bed]

Tags: rocky
alpha23 (alpha23)
description: updated
description: updated
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Does it only happen when attaching cinder volumes? (i.e. the host works normally otherwise?)

Revision history for this message
alpha23 (alpha23) wrote :

Yes, that is correct. The host works normally otherwise.

Note, I did have to do some clean up on the Cinder database due to failed migrations during the Queens to Rocky upgrade, and while I'm not intermittently familiar with the Cinder database, this does not appear to be the root cause of the issue because the Volume attaches to a different host running on a different hypervisor host.

Also, Google searches that returned, for example, "HypervisorUnavailable: Connection to the hypervisor is broken on host" were primarily related to libvirt being down. This is not the case here. I did not see anything that related to and had a solution related to the issue in this report.

I also took the steps of reconfiguring both Nova and Cinder but the issue remains.

Finally, in Queens, volumes attached to all hypervisor hosts with issue.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I'm notifying nova - maybe they are aware of some issue that could cause this. To reiterate - this host works as long as cinder volume is not used? I.e. running an instance from local storage is no problem for it? (that was what I was thinking with my previous question)

FWIW, it does not look like a db problem so far. ;-)

Changed in kolla-ansible:
status: New → Incomplete
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

In the meantime, could you share the nova config files generated on that host? (as available in /etc/nova in nova_compute on that host)

Revision history for this message
alpha23 (alpha23) wrote :
Download full text (3.8 KiB)

The host works without issue and running an instance from either nova or virt-manager can be done without issue. Additionally, I can attach volumes from local stroage to the same VM via virt-manager. It is just the attachment of a Cinder volume to a Nova instance that exhibits the exception and only to one of the hypervisor hosts.

Do you think it may be an issue with the hypervisor host name, e.g. the hostname alone is displayed in the logs but the FQDN is shown in 'hypervisor-list'?

The following is nova.conf. Variables in '<>' have been replaced to conceal their actual. While '<PASSWORD>' is used below in multiple places, these are different; however, '<CONTROLLER_IP>' is the same as there is only 1 controller.

[DEFAULT]
debug = True
log_dir = /var/log/kolla/nova
state_path = /var/lib/nova
osapi_compute_listen = <CONTROLLER_IP>
osapi_compute_listen_port = 8774
osapi_compute_workers = 5
metadata_workers = 5
metadata_listen = <CONTROLLER_IP>
metadata_listen_port = 8775
firewall_driver = nova.virt.firewall.NoopFirewallDriver
allow_resize_to_same_host = true
compute_driver = libvirt.LibvirtDriver
my_ip = <CONTROLLER_IP>
instance_usage_audit = True
instance_usage_audit_period = hour
compute_monitors = nova.compute.monitors.cpu.virt_driver
transport_url = rabbit://openstack:<PASSWORD>@<CONTROLLER_IP>:5672
use_cow_images = false
resume_guests_state_on_host_boot = true

[api]
use_forwarded_for = true

[conductor]
workers = 5

[vnc]
novncproxy_host = <CONTROLLER_IP>
novncproxy_port = 6080
server_listen = <CONTROLLER_IP>
server_proxyclient_address = <CONTROLLER_IP>
novncproxy_base_url = http://<CONTROLLER_IP>:6080/vnc_auto.html

[oslo_middleware]
enable_proxy_headers_parsing = True

[oslo_concurrency]
lock_path = /var/lib/nova/tmp

[glance]
api_servers = http://<CONTROLLER_IP>:9292
num_retries = 1
debug = False

[cinder]
catalog_info = volumev3:cinderv3:internalURL
os_region_name = RegionOne
auth_url = http://<CONTROLLER_IP>:35357
auth_type = password
project_domain_name = Default
user_domain_id = default
project_name = service
username = cinder
password = <PASSWORD>

[neutron]
metadata_proxy_shared_secret = <SHARED_SECRET>
service_metadata_proxy = true
auth_url = http://<CONTROLLER_IP>:35357
auth_type = password
project_domain_name = Default
user_domain_id = default
project_name = service
username = neutron
password = <PASSWORD>
region_name = RegionOne
valid_interfaces = internal

[cache]
backend = oslo_cache.memcache_pool
enabled = True
memcache_servers = <CONTROLLER_IP>:11211

[keystone_authtoken]
www_authenticate_uri = http://<CONTROLLER_IP>:5000
auth_url = http://<CONTROLLER_IP>:35357
auth_type = password
project_domain_id = default
user_domain_id = default
project_name = service
username = nova
password = <PASSWORD>
memcache_security_strategy = ENCRYPT
memcache_secret_key = <SECRET_KEY>
memcached_servers = <CONTROLLER_IP>:11211

[libvirt]
connection_uri = qemu+tcp://<CONTROLLER_IP>/system
virt_type = kvm

[upgrade_levels]
compute = auto

[oslo_messaging_notifications]
transport_url = rabbit://openstack:<PASSWORD>@<CONTROLLER_IP>:5672
driver = messagingv2
topics = notifications,notifications_designate

[privsep_entrypoint]
helper_command =...

Read more...

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Is it the only hypervisor that has this FQDN/hostname discrepancy? It should not really matter unless the code paths actually start caring when Cinder is used but this is a matter for Nova.

I think you did too many CONTROLLER_IP replacements, i.e. some of these (notably my_ip) should have local ip instead.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

As per [1] the [DEFAULT]/host config is used related to the cinder host attachments. So I agree with Radoslaw that this could be a hostname configuration issue.

[1] https://docs.openstack.org/nova/rocky/configuration/config.html#DEFAULT.host

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Hmm, this mentions 'cinder host attachment information' as well so might be related. Still, weird that everything else works, i.e. I would expect the scheduler or networking to fail as well. Though it is always what code thinks (and not the docs). :-)

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Could be a cinder - nova, FQDN - non-FQDN config mismatch but it is hard to tell.

Another angle to tackle this is by dissecting the stack trace:

So the error happens here[1]. The None condition can happen if there was a host_down without a host_up or if there was no host_up since __init__. Both host_up and down is called from the libvirt driver [2]. This code has debug log statements so we might be able to understand the situation better from the logs.

@Alphh23:
Do you have debug logs from nova-compute service?
Can you reproduce the problem and collect such logs?
Does a nova-compute service restart resolve the problem?
Does a compute host reboot resolves the problem?
Does a
  $ openstack compute service set <host> <service> --disable
  $ openstack compute service set <host> <service> --enable
resolves the problem?

Setting the nova bug Incomplete until for now but please set it back to New when you provided the requested information.

[1] https://github.com/openstack/nova/blob/af3adcced959ab480b82e32cf22e953b97e66a58/nova/virt/libvirt/volume/mount.py#L89
[2] https://github.com/openstack/nova/blob/af3adcced959ab480b82e32cf22e953b97e66a58/nova/virt/libvirt/driver.py#L3841

Changed in nova:
status: New → Incomplete
Revision history for this message
alpha23 (alpha23) wrote :

Do you have debug logs from nova-compute service?
    The above log was the debug log; What additionally are you looking for?

Can you reproduce the problem and collect such logs?
     Very reproducible.

Does a nova-compute service restart resolve the problem?
     No.

Does a compute host reboot resolves the problem?
     No

Does a
  $ openstack compute service set <host> <service> --disable
  $ openstack compute service set <host> <service> --enable
resolves the problem?
     No.

alpha23 (alpha23)
Changed in nova:
status: Incomplete → New
Changed in kolla-ansible:
status: Incomplete → New
Revision history for this message
alpha23 (alpha23) wrote :

The issue may be related to the nova database. Curiously, the compute_nodes table produces:

+----------------------+---------------------+--------------------------------------+----------------------+---------------------+
| host | created_at | uuid | hypervisor_hostname | deleted_at |
+----------------------+---------------------+--------------------------------------+----------------------+---------------------+
| st1 | 2018-06-09 21:04:56 | 9c7056b8-21f4-49d2-833b-b446c9315974 | st1 | 2018-10-13 16:48:47 |
| st1 | 2018-10-13 16:48:19 | 14fc8f68-92bd-48dd-8826-915d28fb4822 | st1.xyz.local | 2020-05-26 21:04:00 |
| st1.xyz.local | 2020-05-26 04:36:36 | a3786abe-0389-46d5-ac08-2fce604548e2 | st1.xyz.local | NULL |
| st1 | 2020-05-26 21:04:23 | f840a1c8-432a-41f5-a7ba-e5e7d1d78c29 | st1 | NULL |
+----------------------+---------------------+--------------------------------------+----------------------+---------------------+

ALL of the above refer to the SAME compute node (note the creation and deletion dates).

nova hypervisor list produces:

| ID | Hypervisor hostname | State | Status |
+--------------------------------------+----------------------+-------+----------+
| a3786abe-0389-46d5-ac08-2fce604548e2 | st1.xyz.local | down | disabled |
| f840a1c8-432a-41f5-a7ba-e5e7d1d78c29 | st1 | up | enabled |

Both refer to the same compute node and the later was enabled after changing the /etc/hosts file on the docker host which occured on 5/26 @ 21:04. Both the hostname and the hypervisor hostname should be 'st1' now.

However, the nova instances table results in the following columns, for example (I do not see the hypervisor uuid available as a column in this table):

select host,hostname,node from instances limit 1;
+------+-----------------------------------------------+------+
| host | hostname | node |
+------+-----------------------------------------------+------+
| st1 | zeta-13-cluster-qypnojhoxz5q-primary-master-0 | st1 |
+------+-----------------------------------------------+------+

Is nova-compute attempting to attach a volume to 'st1' but selecting the incorrect host/uuid which is causing it to show HypervisorUnavailable and/or if the hypervisor_hostname was/is a FQDN, did the migration from Queens to Rocky break functionality because of the mismatch between the host and the FQDN (see the 2nd entry which was the compute node at the time of this issue was reported)?

Regardless, the 3rd compute node needs to be deleted. How can this be done without affecting the instances?

Revision history for this message
alpha23 (alpha23) wrote :
Download full text (3.4 KiB)

This issue actually appears to be unrelated to where the exception is thrown. Instead it is related to nova_compute start (relevant code below). The code loops through each guest drive and if the drive is not connect (in my case, there was a CDROM that was unconnected that was unrelated to the instance under test), throws an exception and then prevents all cinder volume attachments. I did not go through and track down what nova compute is actually doing after the fact; however, this was not an issue with Queens. Once the CDROM device was removed, the exception initially exhibited goes away.

Why is the code not skipping over drives that do not have a drive path but instead throws an exception, never complete whatever initialization is required, and then re-appears later with an exception that is unrelated to the root cause?

As a note, the code in Queens worked without issue.

2020-05-31 22:21:55.146 9 DEBUG nova.virt.libvirt.volume.mount [req-acf5eab2-20c2-477c-a902-dfa067ab3137 - - - - -] Discovered volume volume-cd814d75-4b53-4258-b133-1617d3bfecb8 in use for existing mountpoint /var/lib/nova/mnt/f7af0a8752d0f8b2542c2254fb348896 __init__ /var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/volume/mount.py:239
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host [req-acf5eab2-20c2-477c-a902-dfa067ab3137 - - - - -] Exception handling connection event: AttributeError: 'NoneType' object has no attribute 'rfind'
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host Traceback (most recent call last):
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 148, in _dispatch_conn_event
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host handler()
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 414, in handler
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host return self._conn_event_handler(*args, **kwargs)
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 477, in _handle_conn_event
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host self._set_host_enabled(enabled, reason)
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3842, in _set_host_enabled
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host mount.get_manager().host_up(self._host)
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/volume/mount.py", line 134, in host_up
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host self.state = _HostMountState(host, self.generation)
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/volume/mount.py", line 229, in __init__
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvirt.host mountpoint = os.path.dirname(disk.source_path)
2020-05-31 22:21:55.148 9 ERROR nova.virt.libvir...

Read more...

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Marked invalid for kolla-ansible as it does not seem to involve us, please change it if you find it can/has to be fixed in k-a.

Changed in kolla-ansible:
status: New → Invalid
Lee Yarwood (lyarwood)
Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.