Timeout while waiting for network-vif-plugged event during server rebuild

Bug #1694371 reported by Rabi Mishra
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
OpenStack Heat
Invalid
Undecided
Unassigned
neutron
Fix Released
High
Kevin Benton

Bug Description

It seems like the server is going to ERROR state during rebuild.

traceback:

2017-05-28 04:34:37.767757 | 2017-05-28 04:34:37.767 | Captured traceback:
2017-05-28 04:34:37.768828 | 2017-05-28 04:34:37.768 | ~~~~~~~~~~~~~~~~~~~
2017-05-28 04:34:37.770099 | 2017-05-28 04:34:37.769 | b'Traceback (most recent call last):'
2017-05-28 04:34:37.771108 | 2017-05-28 04:34:37.770 | b' File "/opt/stack/new/heat/heat_integrationtests/functional/test_snapshot_restore.py", line 74, in test_stack_snapshot_restore'
2017-05-28 04:34:37.772364 | 2017-05-28 04:34:37.772 | b' self.stack_restore(stack_identifier, snapshot_id)'
2017-05-28 04:34:37.773407 | 2017-05-28 04:34:37.773 | b' File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 626, in stack_restore'
2017-05-28 04:34:37.774541 | 2017-05-28 04:34:37.774 | b' self._wait_for_stack_status(stack_id, wait_for_status)'
2017-05-28 04:34:37.775568 | 2017-05-28 04:34:37.775 | b' File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 357, in _wait_for_stack_status'
2017-05-28 04:34:37.776642 | 2017-05-28 04:34:37.776 | b' fail_regexp):'
2017-05-28 04:34:37.778354 | 2017-05-28 04:34:37.778 | b' File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 321, in _verify_status'
2017-05-28 04:34:37.779448 | 2017-05-28 04:34:37.779 | b' stack_status_reason=stack.stack_status_reason)'
2017-05-28 04:34:37.780644 | 2017-05-28 04:34:37.780 | b"heat_integrationtests.common.exceptions.StackBuildErrorException: Stack StackSnapshotRestoreTest-1374582671/7fb8f800-1545-4e34-a6fa-3e2adbf4443a is in RESTORE_FAILED status due to 'Error: resources.my_server: Rebuilding server failed, status 'ERROR''"
2017-05-28 04:34:37.782119 | 2017-05-28 04:34:37.781 | b''

Noticed at:

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/console.html#_2017-05-28_04_34_37_753094

Looks like a nova issue from the below traceback.

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/logs/screen-n-cpu.txt.gz?level=ERROR#_May_28_04_14_49_044455

May 28 04:14:49.042877 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] File "/opt/stack/new/nova/nova/compute/manager.py", line 6758, in _error_out_instance_on_exception
May 28 04:14:49.042955 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] yield
May 28 04:14:49.043027 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] File "/opt/stack/new/nova/nova/compute/manager.py", line 2814, in rebuild_instance
May 28 04:14:49.043100 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] bdms, recreate, on_shared_storage, preserve_ephemeral)
May 28 04:14:49.043197 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] File "/opt/stack/new/nova/nova/compute/manager.py", line 2855, in _do_rebuild_instance_with_claim
May 28 04:14:49.043299 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] self._do_rebuild_instance(*args, **kwargs)
May 28 04:14:49.043384 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] File "/opt/stack/new/nova/nova/compute/manager.py", line 2977, in _do_rebuild_instance
May 28 04:14:49.043458 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] self._rebuild_default_impl(**kwargs)
May 28 04:14:49.043534 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] File "/opt/stack/new/nova/nova/compute/manager.py", line 2714, in _rebuild_default_impl
May 28 04:14:49.043611 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] block_device_info=new_block_device_info)
May 28 04:14:49.044132 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2741, in spawn
May 28 04:14:49.044223 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] destroy_disks_on_failure=True)
May 28 04:14:49.044299 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5152, in _create_domain_and_network
May 28 04:14:49.044374 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] raise exception.VirtualInterfaceCreateException()
May 28 04:14:49.044455 ubuntu-xenial-osic-cloud1-s3700-9024798 nova-compute[26709]: ERROR nova.compute.manager [instance: 45105d34-b970-4ced-968c-a1c4ead5b282] nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed

Revision history for this message
Rabi Mishra (rabi) wrote :

I can also see it's timing out waiting for vif plugging callback in the logs i.e "Timeout waiting for vif plugging callback for instance 9456073b-a8d6-44ff-b86a-620fbc68a484". Only seen these failures for jobs in in osic cloud. Is it just a coincidence or something to do with the infra?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/469393

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/469393
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=c8e1e82bd6216222f31bb959269ce23bdd65d3d5
Submitter: Jenkins
Branch: master

commit c8e1e82bd6216222f31bb959269ce23bdd65d3d5
Author: Thomas Herve <email address hidden>
Date: Wed May 31 11:25:06 2017 +0200

    Skip snapshot integration tests

    Change-Id: Ica5f87614f5fb98f253d65cbd4355dcc7c6714eb
    Related-Bug: #1694371

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: test_stack_snapshot_restore failing intermittently

Are there any errors on the neutron side in the logs where the vif plug fails or the network-vif-plugged event isn't sent back to nova?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually looks like the vif is plugged successfully. This is where it starts:

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/logs/screen-n-cpu.txt.gz?level=DEBUG#_May_28_04_09_48_125012

And this is where it says it was successful:

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/logs/screen-n-cpu.txt.gz?level=DEBUG#_May_28_04_09_48_221125

And then it times out 5 minutes later:

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/logs/screen-n-cpu.txt.gz?level=DEBUG#_May_28_04_14_48_559598

Looks like the port is plugged on the neutron side here:

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/logs/screen-q-svc.txt.gz#_May_28_04_09_48_412466

But I don't see a network-vif-plugged event. I do see the network-vif-unplugged event 5 minutes later:

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/logs/screen-q-svc.txt.gz#_May_28_04_14_50_528198

There was a network-vif-plugged event before though:

http://logs.openstack.org/16/462216/16/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/17c2da9/logs/screen-q-svc.txt.gz#_May_28_04_09_32_689843

I wonder if ^ is from the original spawn and then the 2nd vif plug is from the rebuild action, and that's why we don't get a network-vif-plugged event from neutron, because it thinks it's already plugged to that instance on that host?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/469765

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/469783

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Thomas Herve (<email address hidden>) on branch: master
Review: https://review.openstack.org/469783

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/469765
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=f8cb5aebe03dec877b1cd204484ddf83e9f5b44a
Submitter: Jenkins
Branch: master

commit f8cb5aebe03dec877b1cd204484ddf83e9f5b44a
Author: rabi <email address hidden>
Date: Thu Jun 1 12:04:20 2017 +0530

    Skip StackSnapshotRestoreTest

    Change Ica5f87614f5fb98f253d65cbd4355dcc7c6714eb adds it to the
    list of scenario tests to skip, whereas it's a functional test.

    Related-Bug: #1694371

    Change-Id: I84e806bad284cffb32e68fef6b14dfff11dba846

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: test_stack_snapshot_restore failing intermittently

I just noticed the same thing in some novaclient functional tests for rebuild where the tests eventually timeout because during rebuild, nova times out waiting for the network-vif-plugged event from neutron:

http://logs.openstack.org/82/471082/1/gate/gate-novaclient-dsvm-functional-neutron-ubuntu-xenial/796acb7/logs/screen-n-cpu.txt.gz#_Jun_06_13_19_17_300332

summary: - test_stack_snapshot_restore failing intermittently
+ Timeout while waiting for network-vif-plugged event during server
+ rebuild
Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Changed in heat:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/471459

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/471459
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=82ecd93a20f580a3bbec96bf570f2fcc1dc99f33
Submitter: Jenkins
Branch: master

commit 82ecd93a20f580a3bbec96bf570f2fcc1dc99f33
Author: Matt Riedemann <email address hidden>
Date: Tue Jun 6 15:52:50 2017 -0400

    libvirt: log vm and task state when vif plugging times out

    This enhances the warning we log when we timeout waiting
    for the network-vif-plugged event from Neutron. It includes
    the vm_state and task_state for context on the instance operation
    since this is done for more than just the initial create, it's
    also for things like rebuild.

    The explicit instance uuid is also removed from the message since
    sending the instance kwarg to LOG.warning logs the instance uuid
    automatically. And _LW is also dropped since we no longer translate
    log messages.

    Change-Id: I6daf1569cba2cfcb4e8da0b46c91d5251c9c6740
    Related-Bug: #1694371

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/473685

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: New → In Progress
Changed in neutron:
status: In Progress → Confirmed
importance: Undecided → High
Changed in nova:
status: Confirmed → Invalid
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/473685
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0d7952400be3e60973a30fae76a82ec5c8259ffc
Submitter: Jenkins
Branch: master

commit 0d7952400be3e60973a30fae76a82ec5c8259ffc
Author: Kevin Benton <email address hidden>
Date: Mon Jun 12 21:56:09 2017 -0700

    Trigger port status DOWN on VIF replug

    Now with the merge of push notifications, processing a port update
    no longer automatically implies a transition from ACTIVE to BUILD
    to ACTIVE again.

    This resulted in a bug where Nova would unplug and replug an interface
    quickly during rebuild and it would never get a vif-plugged event.
    Nothing in the data model was actually being updated that resulted in
    the status being set to DOWN or BUILD and the port would return before
    the agent would process it as a removed port to mark it as DOWN.

    This fixes the bug by making the agent force the port to DOWN whenever
    it loses its VLAN. Watching for the VLAN loss was already introduced
    to detect these fast unplug/plug events before so this just adds the
    status update.

    Closes-Bug: #1694371
    Change-Id: Ice24eea2534fd6f3b103ec014218a65a45492b1f

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b3

This issue was fixed in the openstack/neutron 11.0.0.0b3 development milestone.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/559033

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/559033
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a0599e438aac605230f68d3b6cb16e7fa35a9975
Submitter: Zuul
Branch: stable/ocata

commit a0599e438aac605230f68d3b6cb16e7fa35a9975
Author: Matt Riedemann <email address hidden>
Date: Tue Jun 6 15:52:50 2017 -0400

    libvirt: log vm and task state when vif plugging times out

    This enhances the warning we log when we timeout waiting
    for the network-vif-plugged event from Neutron. It includes
    the vm_state and task_state for context on the instance operation
    since this is done for more than just the initial create, it's
    also for things like rebuild.

    The explicit instance uuid is also removed from the message since
    sending the instance kwarg to LOG.warning logs the instance uuid
    automatically. And _LW is also dropped since we no longer translate
    log messages.

    Conflicts:
            nova/virt/libvirt/driver.py

    NOTE(lyarwood): Conflict is the result of _LW being required in Ocata.

    Change-Id: I6daf1569cba2cfcb4e8da0b46c91d5251c9c6740
    Related-Bug: #1694371
    (cherry picked from commit 82ecd93a20f580a3bbec96bf570f2fcc1dc99f33)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/644431

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.opendev.org/644431
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e38fb9be8dc8c44e051a22515d451f8558be811a
Submitter: Zuul
Branch: stable/ocata

commit e38fb9be8dc8c44e051a22515d451f8558be811a
Author: Kevin Benton <email address hidden>
Date: Mon Jun 12 21:56:09 2017 -0700

    Trigger port status DOWN on VIF replug

    Now with the merge of push notifications, processing a port update
    no longer automatically implies a transition from ACTIVE to BUILD
    to ACTIVE again.

    This resulted in a bug where Nova would unplug and replug an interface
    quickly during rebuild and it would never get a vif-plugged event.
    Nothing in the data model was actually being updated that resulted in
    the status being set to DOWN or BUILD and the port would return before
    the agent would process it as a removed port to mark it as DOWN.

    This fixes the bug by making the agent force the port to DOWN whenever
    it loses its VLAN. Watching for the VLAN loss was already introduced
    to detect these fast unplug/plug events before so this just adds the
    status update.

    Closes-Bug: #1694371
    Change-Id: Ice24eea2534fd6f3b103ec014218a65a45492b1f
    (cherry picked from commit 0d7952400be3e60973a30fae76a82ec5c8259ffc)

Revision history for this message
Derek Harter (derek-u) wrote :
Download full text (10.4 KiB)

Just posting this here for my reference and in case anyone else is experiencing having some instances not starting reliably in OpenStack. The discussion here finally gave me the hint to figure out a work around for my issue, and this seems to be the closest bug report/discussion to what I am experiencing?

I am following the basic OpenStack install instructions using Ubuntu 18.04 server editions with OpenStack Stein. I had gotten a good flat provider network working, and was able to create instances from images and ssh into them. However, after creating 1 or 2 instances, instance startup was randomly failing and appearing to time out. About 1 time in 4 attempts the instance would start, but the rest of the time startup would fail. I had originally been looking at possible kvm/virtualization issues that were causing the vm to pause and never resume while being created, but this ultimately didn't appear to be my issue.

In my nova-compute.log on the compute node where an instance was being created and started, I was seeing the following messages:

-------------------
2019-05-20 13:05:52.868 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbdfa72887f32a79c - default default] [instance: 5d14418e-c766-4fe1-b1d1-fdc0bef4d90d] Attempting claim on node lion-compute-1-01: memory 2048 MB, disk 5 GB, vcpus 1 CPU
2019-05-20 13:05:52.869 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbdfa72887f32a79c - default default] [instance: 5d14418e-c766-4fe1-b1d1-fdc0bef4d90d] Total memory: 12058 MB, used: 4608.00 MB
2019-05-20 13:05:52.869 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbdfa72887f32a79c - default default] [instance: 5d14418e-c766-4fe1-b1d1-fdc0bef4d90d] memory limit not specified, defaulting to unlimited
2019-05-20 13:05:52.870 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbdfa72887f32a79c - default default] [instance: 5d14418e-c766-4fe1-b1d1-fdc0bef4d90d] Total disk: 65 GB, used: 10.00 GB
2019-05-20 13:05:52.870 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbdfa72887f32a79c - default default] [instance: 5d14418e-c766-4fe1-b1d1-fdc0bef4d90d] disk limit not specified, defaulting to unlimited
2019-05-20 13:05:52.871 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbdfa72887f32a79c - default default] [instance: 5d14418e-c766-4fe1-b1d1-fdc0bef4d90d] Total vcpu: 16 VCPU, used: 2.00 VCPU
2019-05-20 13:05:52.872 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbdfa72887f32a79c - default default] [instance: 5d14418e-c766-4fe1-b1d1-fdc0bef4d90d] vcpu limit not specified, defaulting to unlimited
2019-05-20 13:05:52.873 32022 INFO nova.compute.claims [req-76c24954-2ce4-485d-84cd-ead03c6b403e 75e2bcf2a0f2497fb08f18b00b38446a ed5341aae98f48bbbd...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.