After active controller reboot, Build of instance aborted by "Failed to allocate the network(s), not rescheduling"

Bug #1840156 reported by Peng Peng
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
High
Joseph Richard

Bug Description

Brief Description
-----------------
Before active controller reboot, VM could boot up successfully. After reboot testing completed, vm boot up failed with error "Failed to allocate the network(s), not rescheduling".

Severity
--------
Critical

Steps to Reproduce
------------------
as description

TC-name: mtc/test_swact.py::test_swact_controllers

Expected Behavior
------------------
VMs boot up successfully

Actual Behavior
----------------
VM boot up failed

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name: WCP_63-66

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-08-13_20-59-00

Last Pass
---------
2019-08-12_20-59-00

Timestamp/Logs
--------------
[2019-08-14 09:12:38,453] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --flavor=028ab635-1019-4bfa-9742-8a9d08ce7451 --poll --boot-volume=e2109162-f063-437a-812c-51dd356e4929 --nic net-id=6304dc81-adc1-41d5-a57d-2fa05b34e0be --nic net-id=f51ecea2-88e8-48b5-ac54-07a4f01aea4b tenant1-vm-1'

Server building... 100% complete
Finished

[2019-08-14 09:18:46,417] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1

[2019-08-14 09:26:52,225] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-14 09:26:53,608] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| c01a937d-0eaa-4fec-8ab6-1c1107a102eb | 100.114 | NTP address 2602:fde5:2a::1 is not a valid or a reachable NTP server. | host=controller-1.ntp=2602:fde5:2a::1 | minor | 2019-08-14T08:51:32.711473 |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
controller-0:~$

[2019-08-14 09:27:14,243] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --image=b81e1a57-cb9c-4f49-b1ad-2a69c5b8e2de --flavor=028ab635-1019-4bfa-9742-8a9d08ce7451 --poll --nic net-id=6304dc81-adc1-41d5-a57d-2fa05b34e0be --nic net-id=f51ecea2-88e8-48b5-ac54-07a4f01aea4b tenant1-swact_img-2'
[2019-08-14 09:32:55,407] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+---------------------------------------------------------+
| Property | Value |
+--------------------------------------+---------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |

Server building... 0% complete
Error building server
ERROR (ResourceInErrorState):
controller-0:~$

[2019-08-14 09:32:55,512] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show ee1bbe3d-7922-439a-ab65-7dfb2810ecb9'

| fault | {u'message': u'Build of instance ee1bbe3d-7922-439a-ab65-7dfb2810ecb9 aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2322, in _build_and_run_instance\n reason=msg)\n', u'created': u'2019-08-14T09:32:53Z'} |

[2019-08-14 09:33:20,187] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-14 09:33:21,962] 423 DEBUG MainThread ssh.expect :: Output:

controller-0:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :

Confirmed with Joseph that this is a different issue with 1835807.

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Joseph Richard (josephrichard) wrote :

While this is a different issue then 1835807, it is closely related.

In 1835807, the issue is rabbit ignoring this binding:
neutron exchange q-l3-plugin queue q-l3-plugin []

Here, rabbit is ignoring this binding instead:
neutron exchange q-plugin queue q-plugin []

Depending on the fix for 1835807, it may also cover this.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 / high priority given this issue has been seen multiple times in sanity/regression runs. The system does not recover without manual intervention.

tags: added: stx.2.0 stx.networking
Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Richard (josephrichard)
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (6.9 KiB)

Per discussion with Joseph, this issue was encountered on HW lab during storage regression testcase
storage/test_filesystem.py::test_controllerfs_mod_when_host_locked. While the testcase itself is reported as passed, the neutron-ovs-agent log begins reporting failure to report state:

{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID af2b0df78e5b441db4a151c982906da4\n","stream":"stdout","time":"2019-08-18T15:47:13.270851422Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):\n","stream":"stdout","time":"2019-08-18T15:47:13.270901208Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File \"/var/lib/openstack/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py\", line 346, in _report_state\n","stream":"stdout","time":"2019-08-18T15:47:13.270909837Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True)\n","stream":"stdout","time":"2019-08-18T15:47:13.27092587Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File \"/var/lib/openstack/lib/python2.7/site-packages/neutron/agent/rpc.py\", line 101, in report_state\n","stream":"stdout","time":"2019-08-18T15:47:13.270932796Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs)\n","stream":"stdout","time":"2019-08-18T15:47:13.27093909Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 178, in call\n","stream":"stdout","time":"2019-08-18T15:47:13.270954099Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)\n","stream":"stdout","time":"2019-08-18T15:47:13.270960567Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 128, in _send\n","stream":"stdout","time":"2019-08-18T15:47:13.270966461Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)\n","stream":"stdout","time":"2019-08-18T15:47:13.270972536Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 645, in send\n","stream":"stdout","time":"2019-08-18T15:47:13.270978171Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry)\n","stream":"stdout","time":"2019-08-18T15:47:13.270984472Z"}
{"log":"2019-08-18 15:47:1...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Affecting nova regular regression suite as well on the following HW lab
- failed launching the instances with the fault as follows
Lab: IP_33_36
Load: 2019-08-16_20-59-00
Job: StarlingX_2.0_build
Build Server: yow-cgts4-lx.wrs.com
Node Config: 2+2
Software Version: 19.01

eg.
FAIL test_live_migrate_vm_positive[local_image-0-0-None-1-volume-False]
fault | {u'message': u'Build of instance 83dbd0c8-2c7c-4baa-a772-3d6b36d6c76a aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2322, in _build_and_run_instance\n reason=msg)\n', u'created': u'2019-08-18T03:48:01Z'} |

FAIL test_live_migrate_vm_positive[local_image-0-0-dedicated-2-volume-False]
FAIL test_live_migrate_vm_positive[local_image-1-0-dedicated-2-volume-False]
FAIL test_live_migrate_vm_positive[local_image-0-512-shared-1-volume-False]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-2-volume-True]
FAIL test_live_migrate_vm_positive[local_image-0-0-shared-2-image-True]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-1-image-False]
FAIL test_live_migrate_vm_positive[local_image-0-0-None-2-image_with_vol-False]
FAIL test_live_migrate_vm_positive[local_image-0-0-dedicated-1-image_with_vol-True]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-2-image_with_vol-True]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-1-image_with_vol-False]
FAIL test_live_migrate_vm_negative[local_image-0-0-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/678848

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by Joseph Richard (<email address hidden>) on branch: master
Review: https://review.opendev.org/678848
Reason: Just hit this error during my pre-submission testing. Given that disabling the ha policy hasn't entirely prevented this, I am abandoning this change.

Revision history for this message
Joseph Richard (josephrichard) wrote :

Marking as duplicate of https://bugs.launchpad.net/starlingx/+bug/1835807 as this is the same issue in rabbit affecting a different queue. This bug will be tracked from that issue.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Won't Fix to match the duplicate LP

Changed in starlingx:
status: Triaged → Won't Fix
Peng Peng (ppeng)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.