Vm evacuation fails with error 'Virtual Interface creation failed'

Bug #1590490 reported by Tatyana Kuterina
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
High
MOS Nova
9.x
Fix Released
High
MOS Nova

Bug Description

Vm evacuation fails with error 'Virtual Interface creation failed':
After nova evacuate <vm_name> command? vm state changes to Rebuild Spawning and stays in it during 5-6 minutes. After this time status becomes Error.

Error message:
Message
Virtual Interface creation failed
Code
500
Details
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 375, in decorated_function return function(self, context, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2809, in rebuild_instance bdms, recreate, on_shared_storage, preserve_ephemeral) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2853, in _do_rebuild_instance_with_claim self._do_rebuild_instance(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2969, in _do_rebuild_instance self._rebuild_default_impl(**kwargs) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2734, in _rebuild_default_impl block_device_info=new_block_device_info) File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2796, in spawn block_device_info=block_device_info) File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4959, in _create_domain_and_network raise exception.VirtualInterfaceCreateException()
Created
June 8, 2016, 3:34 p.m.

Steps:
Use environment with ceph
1. Create net01
2. Boot instances vm1 and vm2 in net01 on a single compute node
3. Shut off the compute node where instances are scheduled (with virsh destroy)
4. Evacuate instances vm1 and vm2:
     nova evacuate vm1
     nova evacuate vm2
5. Check that they are rescheduled onto another compute node and are in ACTIVE state

Diagnostic Snapshot
https://drive.google.com/a/mirantis.com/file/d/0Bz15vbpS5ZPNZ3EzRkRUZFp0Tzg/view?usp=sharing
Screenshot
https://drive.google.com/a/mirantis.com/file/d/0Bz15vbpS5ZPNX2hkOEJqTGQ1WkE/view?usp=sharing

Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Tatyana, could you please help us to understand the importance here? How often is this reproduced?

Do you by any chance still have an environment with a repro? We'll take a look at logs in the meantime.

Changed in mos:
status: New → Incomplete
assignee: nobody → Tatyana Kuterina (tkuterina)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I quickly checked at the logs and see that the VIF plugging failed because nova-compute timed out waiting:

2016-06-08 15:34:17.571 5121 WARNING nova.virt.libvirt.driver [req-9d7bb106-12f5-4bf4-a203-1aa0ce7e64b1 5f5e223276ef4df7ac1c1519c80ff263 bcd5faf949c249868335474c88dd42b3 - - -] [instance: bac44ad7-eaf8-4d9c-b62e-d8537f107193] Timeout waiting for vif plugging callback for instance bac44ad7-eaf8-4d9c-b62e-d8537f107193

at the same time I see that neutron successfully delivered the notification to nova-api:

2016-06-08 15:29:22.745 977 DEBUG nova.api.openstack.wsgi [req-163dcfa5-129d-44fc-9187-3ac4d7207814 2c5b7c93a21443ab8a6bc4824cb03443 3c1747ea62
c5410faffdb6bb1b29068f - - -] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.comp
ute.server_external_events.ServerExternalEventsController object at 0x7fd1808e82d0>>, body: {"events": [{"status": "completed", "tag": "27d10f0
6-73e4-4aa0-98a2-2f4c29a658af", "name": "network-vif-plugged", "server_uuid": "bac44ad7-eaf8-4d9c-b62e-d8537f107193"}]} _process_stack /usr/lib
/python2.7/dist-packages/nova/api/openstack/wsgi.py:696
2016-06-08 15:29:22.791 977 INFO nova.api.openstack.compute.server_external_events [req-163dcfa5-129d-44fc-9187-3ac4d7207814 2c5b7c93a21443ab8a
6bc4824cb03443 3c1747ea62c5410faffdb6bb1b29068f - - -] Creating event network-vif-plugged:27d10f06-73e4-4aa0-98a2-2f4c29a658af for instance bac
44ad7-eaf8-4d9c-b62e-d8537f107193
2016-06-08 15:29:22.794 977 DEBUG oslo_messaging._drivers.amqpdriver [req-163dcfa5-129d-44fc-9187-3ac4d7207814 2c5b7c93a21443ab8a6bc4824cb03443
 3c1747ea62c5410faffdb6bb1b29068f - - -] CAST unique_id: 9bd64ed934db48b885d8a2f62ae9cb61 size: 5320 exchange: nova topic: compute _send /usr/l
ib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:496
2016-06-08 15:29:22.801 977 INFO nova.osapi_compute.wsgi.server [req-163dcfa5-129d-44fc-9187-3ac4d7207814 2c5b7c93a21443ab8a6bc4824cb03443 3c17
47ea62c5410faffdb6bb1b29068f - - -] 10.109.7.8 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 446 time: 0.0606301

It's not clear why the message was not delivered to the compute side. We'll take a closer look tomorrow.

Revision history for this message
Tatyana Kuterina (tkuterina) wrote :

this is reproduced quite often, almost always. Now I do not have an environment with problem, but probably it will be deployed after 1 p.m today

Revision history for this message
Oleg Bondarev (obondarev) wrote :

There are several error messages in node-4 compute log right before failure:

2016-06-08 15:34:03.958 5121 ERROR nova.compute.resource_tracker [req-50332790-7a77-474d-8b8b-0db77b3648b6 - - - - -] Migration for instance bac44ad7-eaf8-4d9c-b62e-d8537f107193 refers to another host's instance!

so probably the message was dispatched to another host.

Changed in mos:
assignee: Tatyana Kuterina (tkuterina) → Roman Podoliaka (rpodolyaka)
importance: Undecided → High
status: Incomplete → Confirmed
milestone: none → 9.0-updates
tags: added: area-nova
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We tried to reproduce this locally and see that sometimes the event is dispatched to a wrong compute host (last time we successfully evacuated 4 instances of 4, the other time it failed on 1 of 2 instances). It's not clear why yet, but looks like it's a race condition in Nova.

Revision history for this message
Timofey Durakov (tdurakov) wrote :

The root cause of the problem is that Nova could send plugged_vif event to source compute. Dispatching is based on instance.host field which couldn't be updated yet.

tags: added: move-to-mu
tags: added: 10.0-reviewed
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

An attempt to fix this is here:

https://review.openstack.org/#/c/331707

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Release note:

If Neutron is used and instances are evacuated *right* after a compute host failure, they may go to ERROR state when booted on a new host. If this happens, use "nova hard-reboot <instance>" to restart them.

tags: added: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I think it's ok to work on this in 9.2 time frame, as the impact is limited here: this is only reproduced with Neutron when instances are evacuated *right after* a compute host failure, so that neutron-openvswitch-agent does not recognize that a port should have gone into DOWN state - in this case nova-api won't deliver network-vif-plugged event to the destination compute host. Hard rebooting of instance should help. FWIW, the original compute node already failed and these VMs are down anyway, so hard-rebooting is a workaround for now.

tags: added: release-notes-done
removed: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Artom Lifshitz <email address hidden>
Review: https://review.fuel-infra.org/28377

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/28377
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 79fb821d88790505f6d842108123067331c2b574
Author: Artom Lifshitz <email address hidden>
Date: Wed Nov 9 15:33:50 2016

Send events to all relevant hosts if migrating

Previously, external events were sent to the instance object's host
field. This patch fixes the external event dispatching to check for
migration. If an instance is being migrated, the source and
destination compute are added to the set of hosts to which the event
is sent.

Closes-Bug: #1590490

(cherry picked from commit 5de902a3163c9c079fab22754388bd4e02981298)
Change-Id: If00736ab36df4a5a3be4f02b0a550e4bcae77b1b

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

^ merged to 9.0/mitaka

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
tags: added: on-verification
Revision history for this message
Maria Zlatkova (mzlatkova) wrote :

This bug has been described in release notes for 9.1. Adding the release-notes tag to describe the bug in 9.2 resolved issues.

tags: added: release-notes
removed: release-notes-done
Revision history for this message
Ekaterina Shutova (eshutova) wrote :

Fix is working. But "nova evacuate" should be used twice to move instance to another compute(first it leads to instance "error" status) Bug is created: https://bugs.launchpad.net/mos/+bug/1650209
Verified on:
cat /etc/fuel_build_id:
 495
cat /etc/fuel_build_number:
 495
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-nailgun-9.0.0-1.mos8915.noarch
 network-checker-9.0.0-1.mos77.x86_64
 fuel-9.0.0-1.mos6359.noarch
 fuel-misc-9.0.0-1.mos8660.noarch
 fuel-mirror-9.0.0-1.mos158.noarch
 shotgun-9.0.0-1.mos90.noarch
 python-fuelclient-9.0.0-1.mos363.noarch
 fuelmenu-9.0.0-1.mos276.noarch
 fuel-library9.0-9.0.0-1.mos8660.noarch
 fuel-notify-9.0.0-1.mos8660.noarch
 fuel-setup-9.0.0-1.mos6359.noarch
 rubygem-astute-9.0.0-1.mos782.noarch
 fuel-release-9.0.0-1.mos6359.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8915.noarch
 fuel-ostf-9.0.0-1.mos947.noarch
 fuel-bootstrap-cli-9.0.0-1.mos291.noarch
 fuel-ui-9.0.0-1.mos2848.noarch
 fuel-agent-9.0.0-1.mos291.noarch
 python-packetary-9.0.0-1.mos158.noarch
 nailgun-mcagents-9.0.0-1.mos782.noarch
 fuel-migrate-9.0.0-1.mos8660.noarch
 fuel-utils-9.0.0-1.mos8660.noarch
 fuel-openstack-metadata-9.0.0-1.mos8915.noarch

tags: added: release-notes-done
removed: on-verification release-notes
tags: added: release-notes
removed: release-notes-done
Revision history for this message
Maria Zlatkova (mzlatkova) wrote :

A release note has been added to 9.2 resolved issues: https://review.fuel-infra.org/#/c/29674.

tags: added: release-notes-done
removed: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.