StarlingX

After active controller reboot, Build of instance aborted by "Failed to allocate the network(s), not rescheduling"

Bug #1840156 reported by Peng Peng on 2019-08-14

This bug report is a duplicate of: Bug #1835807: neutron-l3-agent and neutron-dhcp-agent never recovered after force reboot. Edit Remove

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	High	Joseph Richard

Bug Description

Brief Description
-----------------
Before active controller reboot, VM could boot up successfully. After reboot testing completed, vm boot up failed with error "Failed to allocate the network(s), not rescheduling".

Severity
--------
Critical

Steps to Reproduce
------------------
as description

TC-name: mtc/test_swact.py::test_swact_controllers

Expected Behavior
------------------
VMs boot up successfully

Actual Behavior
----------------
VM boot up failed

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name: WCP_63-66

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-08-13_20-59-00

Last Pass
---------
2019-08-12_20-59-00

Timestamp/Logs
--------------
[2019-08-14 09:12:38,453] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --flavor=028ab635-1019-4bfa-9742-8a9d08ce7451 --poll --boot-volume=e2109162-f063-437a-812c-51dd356e4929 --nic net-id=6304dc81-adc1-41d5-a57d-2fa05b34e0be --nic net-id=f51ecea2-88e8-48b5-ac54-07a4f01aea4b tenant1-vm-1'

Server building... 100% complete
Finished

[2019-08-14 09:18:46,417] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1

[2019-08-14 09:26:52,225] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-14 09:26:53,608] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| c01a937d-0eaa-4fec-8ab6-1c1107a102eb | 100.114 | NTP address 2602:fde5:2a::1 is not a valid or a reachable NTP server. | host=controller-1.ntp=2602:fde5:2a::1 | minor | 2019-08-14T08:51:32.711473 |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
controller-0:~$

[2019-08-14 09:27:14,243] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --image=b81e1a57-cb9c-4f49-b1ad-2a69c5b8e2de --flavor=028ab635-1019-4bfa-9742-8a9d08ce7451 --poll --nic net-id=6304dc81-adc1-41d5-a57d-2fa05b34e0be --nic net-id=f51ecea2-88e8-48b5-ac54-07a4f01aea4b tenant1-swact_img-2'
[2019-08-14 09:32:55,407] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+---------------------------------------------------------+
| Property | Value |
+--------------------------------------+---------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |

Server building... 0% complete
Error building server
ERROR (ResourceInErrorState):
controller-0:~$

[2019-08-14 09:32:55,512] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show ee1bbe3d-7922-439a-ab65-7dfb2810ecb9'

| fault | {u'message': u'Build of instance ee1bbe3d-7922-439a-ab65-7dfb2810ecb9 aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2322, in _build_and_run_instance\n reason=msg)\n', u'created': u'2019-08-14T09:32:53Z'} |

[2019-08-14 09:33:20,187] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-14 09:33:21,962] 423 DEBUG MainThread ssh.expect :: Output:

controller-0:~$

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-08-14:

ALL_NODES_20190814.141802.tar Edit (74.8 MiB, application/x-tar)

Confirmed with Joseph that this is a different issue with 1835807.

Numan Waheed (nwaheed) on 2019-08-14

tags:

added: stx.retestneeded

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-08-14:

While this is a different issue then 1835807, it is closely related.

In 1835807, the issue is rabbit ignoring this binding:
neutron exchange q-l3-plugin queue q-l3-plugin []

Here, rabbit is ignoring this binding instead:
neutron exchange q-plugin queue q-plugin []

Depending on the fix for 1835807, it may also cover this.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-15:

Marking as stx.2.0 / high priority given this issue has been seen multiple times in sanity/regression runs. The system does not recover without manual intervention.

tags:	added: stx.2.0 stx.networking
Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Joseph Richard (josephrichard)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-19:

Download full text (6.9 KiB)

Per discussion with Joseph, this issue was encountered on HW lab during storage regression testcase
storage/test_filesystem.py::test_controllerfs_mod_when_host_locked. While the testcase itself is reported as passed, the neutron-ovs-agent log begins reporting failure to report state:

{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID af2b0df78e5b441db4a151c982906da4\n","stream":"stdout","time":"2019-08-18T15:47:13.270851422Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):\n","stream":"stdout","time":"2019-08-18T15:47:13.270901208Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py\", line 346, in _report_state\n","stream":"stdout","time":"2019-08-18T15:47:13.270909837Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     True)\n","stream":"stdout","time":"2019-08-18T15:47:13.27092587Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/neutron/agent/rpc.py\", line 101, in report_state\n","stream":"stdout","time":"2019-08-18T15:47:13.270932796Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)\n","stream":"stdout","time":"2019-08-18T15:47:13.27093909Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 178, in call\n","stream":"stdout","time":"2019-08-18T15:47:13.270954099Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)\n","stream":"stdout","time":"2019-08-18T15:47:13.270960567Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 128, in _send\n","stream":"stdout","time":"2019-08-18T15:47:13.270966461Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)\n","stream":"stdout","time":"2019-08-18T15:47:13.270972536Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 645, in send\n","stream":"stdout","time":"2019-08-18T15:47:13.270978171Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     call_monitor_timeout, retry=retry)\n","stream":"stdout","time":"2019-08-18T15:47:13.270984472Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 634, in _send\n","stream":"stdout","time":"2019-08-18T15:47:13.270991265Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     call_monitor_timeout)\n","stream":"stdout","time":"2019-08-18T15:47:13.270997652Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 520, in wait\n","stream":"stdout","time":"2019-08-18T15:47:13.271003869Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)\n","stream":"stdout","time":"2019-08-18T15:47:13.271010341Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 397, in get\n","stream":"stdout","time":"2019-08-18T15:47:13.271023313Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)\n","stream":"stdout","time":"2019-08-18T15:47:13.271030369Z"}
{"log":"2019-08-18 15:47:13.263 98 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID af2b0df78e5b441db4a151c982906da4\n","stream":"stdout","time":"2019-08-18T15:47:13.271036148Z"}

PASS	20190818 15:45:32	testcases/functional/storage/test_filesystem.py::test_controllerfs_mod_when_host_locked

The sequence of commands run from controller-0 were as follows:

controller-1 host locked
2019-08-18T15:45:40.000 controller-0 system .... host-lock controller-1
….
2019-08-18T15:46:03.000 controller-0 system ... controllerfs-show database

system controllerfs-modify database
2019-08-18T15:46:05.000 controller-0  system....controllerfs-modify database="21"

unlock controller-1
2019-08-18T15:46:18.000 controller-0 system .... host-unlock controller-1

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-19:

ALL_NODES_20190819.173631.tar Edit (222.5 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-20:

Affecting nova regular regression suite as well on the following HW lab
- failed launching the instances with the fault as follows
Lab: IP_33_36
Load: 2019-08-16_20-59-00
Job: StarlingX_2.0_build
Build Server: yow-cgts4-lx.wrs.com
Node Config: 2+2
Software Version: 19.01

eg.
FAIL test_live_migrate_vm_positive[local_image-0-0-None-1-volume-False]
fault | {u'message': u'Build of instance 83dbd0c8-2c7c-4baa-a772-3d6b36d6c76a aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2322, in _build_and_run_instance\n reason=msg)\n', u'created': u'2019-08-18T03:48:01Z'} |

FAIL test_live_migrate_vm_positive[local_image-0-0-dedicated-2-volume-False]
FAIL test_live_migrate_vm_positive[local_image-1-0-dedicated-2-volume-False]
FAIL test_live_migrate_vm_positive[local_image-0-512-shared-1-volume-False]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-2-volume-True]
FAIL test_live_migrate_vm_positive[local_image-0-0-shared-2-image-True]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-1-image-False]
FAIL test_live_migrate_vm_positive[local_image-0-0-None-2-image_with_vol-False]
FAIL test_live_migrate_vm_positive[local_image-0-0-dedicated-1-image_with_vol-True]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-2-image_with_vol-True]
FAIL test_live_migrate_vm_positive[local_image-1-512-dedicated-1-image_with_vol-False]
FAIL test_live_migrate_vm_negative[local_image-0-0-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-27: Related fix proposed to config (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/678848

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Change abandoned on config (master)

Change abandoned by Joseph Richard (<email address hidden>) on branch: master
Review: https://review.opendev.org/678848
Reason: Just hit this error during my pre-submission testing. Given that disabling the ha policy hasn't entirely prevented this, I am abandoning this change.

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-09-19:

Marking as duplicate of https://bugs.launchpad.net/starlingx/+bug/1835807 as this is the same issue in rabbit affecting a different queue. This bug will be tracked from that issue.