Inability to add nova-compute host to os-aggregate

Bug #1512908 reported by Chad Smith
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Landscape Server
Fix Released
High
Andreas Hasenack
Release-31
Fix Released
High
Andreas Hasenack
nova-compute (Juju Charms Collection)
Fix Released
High
Liam Young

Bug Description

From the Landscape Autopilot dashboard, we see a POST request for http://10.1.70.109:8774/v2/aa5fd8bda2344d5992b954e75853746e/os-aggregates/1/action failed with code 404

Nov 3 21:56:43 job-handler-1 INFO Traceback (failure with no frames): <class 'canonical.openstack.api.HTTPError'>: POST request for http://10.1.70.109:8774/v2/aa5fd8bda2344d5992b954e75853746e/os-aggregates/1/action failed with code 404 Not Found: '{"itemNotFound": {"message": "Cannot add host newton in aggregate 1: not found", "code": 404}}'

This issue reflects that newton didn't get registered as a compute node for some reason.

4 node cloud deployment swift/iscsi using ohm, pascal, tesla and newton

# Missing hostnames for newton and tesla in nova host-list
csmith@downtown:~$ nova host-list
+----------------------+-----------+-----------+
| host_name | service | zone |
+----------------------+-----------+-----------+
| juju-machine-1-lxc-1 | conductor | internal |
| juju-machine-1-lxc-1 | cert | internal |
| juju-machine-1-lxc-1 | scheduler | internal |
| pascal | compute | region1-1 |
| ohm | compute | region1-1 |
+----------------------+-----------+-----------+

Expected values would contain all compute hosts listed in nova host-list
# Successful from a comparable deployment that contains 4 compute nodes.

csmith@downtown:~$ nova host-list
+----------------------+-----------+-----------+
| host_name | service | zone |
+----------------------+-----------+-----------+
| juju-machine-2-lxc-2 | cert | internal |
| juju-machine-2-lxc-2 | conductor | internal |
| juju-machine-2-lxc-2 | scheduler | internal |
| elkhart | compute | region1-1 |
| sekine | compute | region1-1 |
| darby | compute | region1-1 |
| albany | compute | region1-4 |
+----------------------+-----------+-----------+

Related branches

Revision history for this message
Chad Smith (chad.smith) wrote :

OSA logs from the failed deployment. Deployment is still up for additional debugging in bstack

Revision history for this message
Chad Smith (chad.smith) wrote :

Looks like this is the same failure mode as lp:1490519 that Mark saw on gmaas.

Revision history for this message
Chad Smith (chad.smith) wrote :

This deployment is liberty.

Revision history for this message
Chad Smith (chad.smith) wrote :

Well, I take back comment #4, the aggregate (AZ) didn't disappear, the error message was just vague.

Cannot add host newton to aggregate: meant that the host newton didn't exist, not the aggregate.

Looking over /var/log/nova/nova-compute.log on neutron, we see it failed to initialize a connection to libvirt and as a result didn't register newton as a compute host.

2015-11-03 21:55:46.725 4836 WARNING nova.virt.libvirt.driver [req-3f9aa586-23d0-48da-8465-d307362bf252 - - - - -] Cannot update service status on host "newton" since it is not registered.
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host [req-3f9aa586-23d0-48da-8465-d307362bf252 - - - - -] Connection to libvirt failed: Cannot recv data: Connection reset by peer
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host Traceback (most recent call last):
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/host.py", line 528, in get_connection
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host conn = self._get_connection()
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/host.py", line 515, in _get_connection
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host wrapped_conn = self._get_new_connection()
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/host.py", line 467, in _get_new_connection
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host wrapped_conn = self._connect(self._uri, self._read_only)
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/host.py", line 321, in _connect
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host libvirt.openAuth, uri, auth, flags)
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host rv = execute(f, *args, **kwargs)
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host six.reraise(c, e, tb)
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host rv = meth(*args, **kwargs)
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host File "/usr/lib/python2.7/dist-packages/libvirt.py", line 105, in openAuth
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host if ret is None:raise libvirtError('virConnectOpenAuth() failed')
2015-11-03 21:55:46.726 4836 ERROR nova.virt.libvirt.host libvirtError: Cannot recv data: Connection reset by peer

I'm still having trouble determining what caused the libvirt connection to fail on newton.

description: updated
Chad Smith (chad.smith)
Changed in landscape:
importance: Undecided → Medium
Revision history for this message
Chad Smith (chad.smith) wrote :

the only correlation I can see is a relation-change in nova-compute-unit log at 21:55:46 is in juju nova-compute-unit log.

2015-11-03 21:55:46 INFO cloud-compute-relation-changed libvirt-bin stop/waiting
2015-11-03 21:55:48 INFO cloud-compute-relation-changed libvirt-bin start/running, process 7553

I'm wondering if the charm shut down libvirt at the same time as newton was trying to make libvirt calls and then newton ejected and didn't retry.

On other successfully registered compute nodes I see that the timing of the cloud-compute-relation-changed is the same 21:55:46 for the libvirt stop/start but the act of registering the compute node and hitting libvirt happens at a different time than the libvirt restart:

nova-compute/2 (success):
2015-11-03 21:55:18.472 15434 WARNING nova.virt.libvirt.driver [req-bc667516-e9b9-4f63-ad59-ad27ed3c1614 - - - - -] Cannot update service status on host "pascal" since it is not registered.
2015-11-03 21:55:18.834 15434 ERROR nova.compute.manager [req-bc667516-e9b9-4f63-ad59-ad27ed3c1614 - - - - -] No compute node record for host pascal
2015-11-03 21:55:18.840 15434 WARNING nova.compute.monitors [req-bc667516-e9b9-4f63-ad59-ad27ed3c1614 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2015-11-03 21:55:18.922 15434 WARNING nova.compute.resource_tracker [req-bc667516-e9b9-4f63-ad59-ad27ed3c1614 - - - - -] No compute node record for pascal:pascal.beretstack

nova-compute/0 (success):
2015-11-03 21:55:23.472 19036 WARNING nova.virt.libvirt.driver [req-b4fb8b11-fabe-4afc-9225-a14a241fd22b - - - - -] Cannot update service status on host "ohm" since it is not registered.
2015-11-03 21:55:23.659 19036 ERROR nova.compute.manager [req-b4fb8b11-fabe-4afc-9225-a14a241fd22b - - - - -] No compute node record for host ohm
2015-11-03 21:55:23.667 19036 WARNING nova.compute.monitors [req-b4fb8b11-fabe-4afc-9225-a14a241fd22b - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2015-11-03 21:55:23.753 19036 WARNING nova.compute.resource_tracker [req-b4fb8b11-fabe-4afc-9225-a14a241fd22b - - - - -] No compute node record for ohm:ohm.beretstack

description: updated
Chad Smith (chad.smith)
summary: - Inability to add missing host to os-aggregate
+ Inability to add nova-compute host to os-aggregate
Revision history for this message
Chad Smith (chad.smith) wrote :

This bug doesn't necessarily look actionable by landscape, the nova-compute charm restarts libvirt-bin on most config-changed hooks. Nova compute manager runs a number of periodic task which creates a new connection to libvirt and attempts to _set_host_enabled(). If the connection fails due to libvirt-bin service being down during a service restart, the compute manager tracesback and then stops working and the compute node never gets registered.

The fix would be for nova compute manager's periodic_tasks to be a bit more resilient of the libvirt connection fails.

One "workaround" in our charms would be to restart nova-compute services after any libvirt-bin restart to ensure the manager is recovered and operational in case there was a periodic_task collision.

Chad Smith (chad.smith)
tags: added: kanban-cross-team nova-compute
Chad Smith (chad.smith)
information type: Proprietary → Public
affects: landscape → nova-compute (Juju Charms Collection)
description: updated
tags: removed: kanban-cross-team
Revision history for this message
Chad Smith (chad.smith) wrote :

From /var/log/libvirt/libvirtd.log at the same time as the libvirt-bin service restart:

2015-11-03 21:55:46.700+0000: 4880: error : virCommandWait:2552 : internal error: Child process (/sbin/iptables -w --table nat --insert POSTROUTING --source 192.168.122.0/24 '!' --destination 192.168.122.0/24 --jump MASQUERADE) unexpected fatal signal 15
~

Chad Smith (chad.smith)
tags: added: landscape
removed: nova-compute
Chad Smith (chad.smith)
description: updated
David Britton (dpb)
Changed in nova-compute (Juju Charms Collection):
importance: Medium → High
Liam Young (gnuoy)
Changed in nova-compute (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
James Page (james-page) wrote :

The compute host records are created by the nova-conductor service as the nova-compute daemons startup - there is code in the charms to ensure that compute daemons are restarted after the nova database etc.. has been setup and the conductors are in a state where they should be ready to process rpc calls from compute services.

A few more bits of information may help us ID where the problem is; specifically

1) logs from /var/log/nova from cloud-controller and nova-compute instances
2) the output of nova service-list - this will show all registered services, including ones that are down. (host-list is up only)

Revision history for this message
James Page (james-page) wrote :

Actually this is probably the cause of the restarts of libvirt-bin:

        if config('hugepages'):
            ctxt['hugepages'] = True

        ctxt['host_uuid'] = '%s' % uuid.uuid4()
        return ctxt

every time the libvirtd.conf is rendered, libvirt-bin is restart as the uuid is regenerated.

Revision history for this message
James Page (james-page) wrote :

Also confirming that if libvirt-bin is down when nova-compute starts up, it shutdowns straight away.

Hard to reproduce as the nova-compute upstart configuration waits for libvirt-bin to start in its pre-start script.

Changed in nova-compute (Juju Charms Collection):
status: New → In Progress
Changed in nova-compute (Juju Charms Collection):
status: In Progress → Fix Committed
milestone: none → 16.01
tags: added: backport-potential
James Page (james-page)
Changed in nova-compute (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in landscape:
status: New → Fix Committed
importance: Undecided → High
assignee: nobody → Andreas Hasenack (ahasenack)
milestone: none → 15.12
Changed in landscape:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.