nova refuses to start if there are baremetal instances with no associated node

Bug #1272623 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Unassigned
tripleo
Fix Released
High
Unassigned

Bug Description

This can happen if a deployment is interrupted at just the wrong time.

2014-01-25 06:53:38,781.781 14556 DEBUG nova.compute.manager [req-e1958f79-b0c0-4c80-b284-85bb56f1541d None None] [instance: e21e6bca-b528-4922-9f59-7a1a6534ec8d] Current state is 1, state in DB is 1. _init_instance /opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py:720
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 346, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 56, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
    result = function(*args, **kwargs)
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/openstack/common/service.py", line 480, in run_service
    service.start()
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/service.py", line 172, in start
    self.manager.init_host()
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 805, in init_host
    self._init_instance(context, instance)
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 684, in _init_instance
    self.driver.plug_vifs(instance, net_info)
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/driver.py", line 538, in plug_vifs
    self._plug_vifs(instance, network_info)
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/driver.py", line 543, in _plug_vifs
    node = _get_baremetal_node_by_instance_uuid(instance['uuid'])
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/driver.py", line 85, in _get_baremetal_node_by_instance_uuid
    node = db.bm_node_get_by_instance_uuid(ctx, instance_uuid)
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/db/api.py", line 101, in bm_node_get_by_instance_uuid
    instance_uuid)
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 112, in wrapper
    return f(*args, **kwargs)
  File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/virt/baremetal/db/sqlalchemy/api.py", line 152, in bm_node_get_by_instance_uuid
    raise exception.InstanceNotFound(instance_id=instance_uuid)
InstanceNotFound: Instance 84c6090b-bf42-4c6a-b2ff-afb22b5ff156 could not be found.

If there is no allocated node, we can just skip that part of delete.

Tags: baremetal
summary: - nova refuses to delete baremetal instances if there is no associated
- node
+ nova refuses to start if there are baremetal instances with no
+ associated node
description: updated
Revision history for this message
Robert Collins (lifeless) wrote :

We're running a monkeypatch to avoid this at the moment

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Robert Collins (lifeless) wrote :

questions from jog:
 - how does this happen
 - is this only nova-bm ?

Revision history for this message
Robert Collins (lifeless) wrote :

On startup nova-compute attempts to restore the state of the node to it's internal model. E.g. start vms that are meant to be running, fully delete vms that are means to be purged from disk etc.

We also try to start VMs in state 'ERROR' here, which AFAICT doesn't happen in any other circumstance. This is conceptually problematic because ERROR is used to indicate that nova has given up on the VM, rather than it being in the middle of an operation which needs resuming.

One particular thing that can happen is that once a VM is in state ERROR, there is no guarantee that the axioms for it are maintained - it might not have had networking allocated, for instance.

The thing that caused this particular backtrace here was an instance of that: nova-compute error the VM before writing the instance id to the bm_nodes table (which is what captures the association of instance to node). This happened quite legitimately - the scheduler was trying to schedule to an already used node (due to a different issue - but the scheduler is intrinsically racy, so this should be expected in general). Then when restarted nova-compute attempted to restart the ERROR state VM, and threw an exception (rightly so, attempting to power on nothing is an error)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/69108
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6534a89de9cabc274cbdb7d2ecee3d851c456a87
Submitter: Jenkins
Branch: master

commit 6534a89de9cabc274cbdb7d2ecee3d851c456a87
Author: Steve Kowalik <email address hidden>
Date: Sat Jan 25 20:00:19 2014 +1300

    Don't try to restore VM's in state ERROR.

    We don't try to restore VM's that are in a failed BUILDING state, so
    attempting to start ERROR VMs is more than a little weird. The one
    exception to this rule are VMs that are in RESIZE_MIGRATING, since
    recovery is already attempted. It's also a problem, because many ERROR
    states aren't recoverable from (at the moment anyhow).

    Closes-Bug: #1272623
    Change-Id: I0599b83a82ad3ee67a92126d3b57df5b02e20539
    Co-Authored-By: Robert Collins <email address hidden>

Changed in nova:
status: Triaged → Fix Committed
Changed in nova:
milestone: none → icehouse-3
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/81690

Revision history for this message
wangpan (hzwangpan) wrote :
Download full text (4.6 KiB)

I believe this bug also affects libvirt driver(qemu hypervisor) in havana, so I cherry picked it to havana.
please see the trace as below, nova-compute tried to restore an error instance, and it failed to start at end.

2014-03-19 23:02:57.783 24757 DEBUG nova.virt.libvirt.vif [req-d5ae9690-179e-4661-928f-9c3febaeda5f None None] vif_type=binding_failed instance=<nova.objects.instance.Instance object at 0x3f38750> vif=VIF({'ovs_interfaceid': None, 'network': Network({'bridge': None, 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': u'fixed', 'floating_ips': [], 'address': u'10.0.17.4'})], 'version': 4, 'meta': {u'dhcp_server': u'10.0.17.3'}, 'dns': [], 'routes': [], 'cidr': u'10.0.17.0/24', 'gateway': IP({'meta': {}, 'version': 4, 'type': u'gateway', 'address': u'10.0.17.1'})})], 'meta': {u'injected': False, u'tenant_id': u'e1caab985d1e4418a8b0e4d869afdd25'}, 'id': u'001361ce-ebbf-44de-aa7a-2943682bfa3a', 'label': u'admin-test'}), 'devname': u'tapa13b662a-24', 'qbh_params': None, 'meta': {}, 'address': u'fa:16:3e:15:ac:a5', 'type': u'binding_failed', 'id': u'a13b662a-24d3-4ccd-8788-f59151376e6f', 'qbg_params': None}) plug /usr/lib/python2.7/dist-packages/nova/virt/libvirt/vif.py:544
2014-03-19 23:02:57.786 24757 ERROR nova.openstack.common.threadgroup [-] Unexpected vif_type=binding_failed
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 117, in wait
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup x.wait()
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 49, in wait
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup return self.thread.wait()
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup return self._exit_event.wait()
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup return hubs.get_hub().switch()
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup return self.greenlet.switch()
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup result = function(*args, **kwargs)
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup File "/usr/lib/python2.7/dist-packages/nova/openstack/common/service.py", line 65, in run_service
2014-03-19 23:02:57.786 24757 TRACE nova.openstack.common.threadgroup service....

Read more...

Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-3 → 2014.1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/havana)

Change abandoned by Wangpan (<email address hidden>) on branch: stable/havana
Review: https://review.openstack.org/81690

Revision history for this message
Ben Nemec (bnemec) wrote :

It appears this has been fixed in Nova for a long time.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.