StarlingX

After active controller reboot, VM boot up failed with error "Failed to allocate the network(s), not rescheduling"

Bug #1836928 reported by Peng Peng on 2019-07-17

This bug report is a duplicate of: Bug #1835807: neutron-l3-agent and neutron-dhcp-agent never recovered after force reboot. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	High	Joseph Richard

Bug Description

Brief Description
-----------------
stx-openstack application applied successfully, then boot up VMs, but failed.
error msg: "Failed to allocate the network(s), not rescheduling"

It seems different from https://bugs.launchpad.net/starlingx/+bug/1833622

Severity
--------
Critical

Steps to Reproduce
------------------
as description

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name: WCP_63-66

Branch/Pull Time/Commit
-----------------------
stx master as of 20190717T013000Z

Last Pass
---------
20190712T013000Z

Timestamp/Logs
--------------
[2019-07-17 09:32:46,241] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-17 09:32:47,775] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+

[2019-07-17 09:33:50,719] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=0ba26c17-f20b-4653-b98c-3cccedbd758e --key-name=keypair-tenant1 --poll --flavor=0e193dd4-87ba-4cb9-a6ac-d2dd6cf55aef --nic net-id=f6712dfb-9cba-4d3b-92f4-6cd43c64cd8d --nic net-id=02cf1817-9255-4eb4-a3d9-8559f6a395c1 tenant1-vm-1'
[2019-07-17 09:39:17,028] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-------------------------------------------------+
| Property | Value |
+--------------------------------------+-------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | hztD5j7XeXbF |
| config_drive | |
| created | 2019-07-17T09:33:54Z |
| description | - |
| flavor:disk | 2 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {"hw:mem_page_size": "2048"} |
| flavor:original_name | flavor-default-size2 |
| flavor:ram | 1024 |
| flavor:swap | 0 |
| flavor:vcpus | 1 |
| hostId | |
| id | 1f4bafed-2ef4-4044-9486-5223de2f3cdb |
| image | Attempt to boot from volume - no image supplied |
| key_name | keypair-tenant1 |
| locked | False |
| metadata | {} |
| name | tenant1-vm-1 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| server_groups | [] |
| status | BUILD |
| tags | [] |
| tenant_id | d8f07a22e9c3411782c491a5f1bcf9d8 |
| trusted_image_certificates | - |
| updated | 2019-07-17T09:33:54Z |
| user_id | 4c13574c55d944b3bd3258169ebe4c06 |
+--------------------------------------+-------------------------------------------------+

Server building... 0% complete
Server building... 0% complete
Error building server
ERROR (ResourceInErrorState):

[2019-07-17 09:39:17,133] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show 1f4bafed-2ef4-4044-9486-5223de2f3cdb'

| created | 2019-07-17T09:33:54Z |
| fault | {u'message': u'Build of instance 1f4bafed-2ef4-4044-9486-5223de2f3cdb aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2322, in _build_and_run_instance\n reason=msg)\n', u'created': u'2019-07-17T09:39:14Z'} |
| flavor | flavor-default-size2 (0e193dd4-87ba-4cb9-a6ac-d2dd6cf55aef)

[2019-07-17 09:45:56,018] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-17 09:45:57,479] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+

Test Activity
-------------
Sanity

See original description

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-17:

ALL_NODES_20190717.151803.tar Edit (24.8 MiB, application/x-tar)

Numan Waheed (nwaheed) on 2019-07-17

description:	updated
tags:	added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-17:

@Peng, can you please provide more info on why you believe this bug is different from https://bugs.launchpad.net/starlingx/+bug/1833622 ? Thanks

description:	updated
Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-17:

1833622 marked as duplicated with 1833609 - stx-openstack application stuck at processing chart: osh-openstack-ceph-rgw after unlock standby controller

But when this bug (1836928) happened, stx-openstack application status was applied.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-18:

@Peng, please add a new set of logs from the new occurrence.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-18:

ALL_NODES_20190718.135724.tar Edit (22.9 MiB, application/x-tar)

Issue was reproduced on
Lab: WCP_63_66
Load: 20190718T013000Z

from 09:29 to 11:29, sxt-openstack application status is applied.
But during this period, all VMs boot up failed

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-18:

@ppeng, where there any node reboots (controller or compute) triggered by the sanity suite prior to the VM boot up failures? Were there any successful VM launches in this sanity run?

tags:

added: stx.sanity

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-18:

Yes, there was active controller reboot before this issue happen, and there is a success VM boot up prior the reboot. The detail timestamp is as below

09:30 VM boot up success
09:37 active controller-1 reboot
09:45 controller-1 fully up
09:46 VM boot up failed as description
then the rest VM booy up all failed.
11:29 last VM boot up failed

test suite checked application-list during these period. stx-openstack application status all show applied/completed

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-18:

If this issue started happening after a node reboot, then you are hitting:
https://bugs.launchpad.net/starlingx/+bug/1835807

Marking as a duplicate.

Changed in starlingx:
importance:	Undecided → High
status:	Incomplete → Triaged
tags:	added: stx.2.0 stx.networking

Ghada Khalil (gkhalil) on 2019-07-18

Changed in starlingx:
assignee:	nobody → Joseph Richard (josephrichard)
summary:	- VM boot up failed, fault msg "Failed to allocate the network(s), not - rescheduling" + After active controller reboot, VM boot up failed with error "Failed to + allocate the network(s), not rescheduling"

Revision history for this message

Yang Liu (yliu12) wrote on 2019-07-24:

#10

Download full text (5.4 KiB)

In latest sanity, it does not look like it is caused by force reboot of controller.
Here are the time stamps:
# Force rebooted active controller controller-1
[2019-07-24 09:35:04,259] 180 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1

# Launched two vms and swact active controller
[2019-07-24 09:44:38,117] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

# lock/unlock compute-1
[2019-07-24 09:48:06,631] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-07-24 09:48:38,121] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin -...

In latest sanity, it does not look like it is caused by force reboot of controller.
Here are the time stamps:
# Force rebooted active controller controller-1
[2019-07-24 09:35:04,259] 180  INFO  MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1

# Launched two vms and swact active controller
[2019-07-24 09:44:38,117] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

# lock/unlock compute-1
[2019-07-24 09:48:06,631] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-07-24 09:48:38,121] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-1'

# Failed to launch new vm with error "Failed to allocate the network(s), not rescheduling"
[2019-07-24 09:58:17,777] 301  DEBUG MainThread ssh.send    :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --boot-volume=4b712ccc-9f3a-4992-a4ab-8483428561fd --flavor=d8ee40f1-2065-4869-9963-1e42f5242814 --poll --nic net-id=962829d9-4520-4a13-b755-445ccd49e1bb tenant1-tis-centos-guest_vifs-4'

# VM started to launch successfully, interestingly, after lock/unlock another compute - compute-0.
[2019-07-24 12:37:44,816] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'

## mempages were modified here, but should be unrelated, so not copying the cmd.

[2019-07-24 12:38:19,995] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2019-07-24 12:49:59,781] 301  DEBUG MainThread ssh.send    :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=59b33b1b-36de-42cf-bbf4-47131f8949be --availability-zone=cgcsauto --key-name=keypair-tenant1 --flavor=8270081b-0c73-4625-b8dc-5f91aea3fa00 --poll --nic net-id=962829d9-4520-4a13-b755-445ccd49e1bb --nic net-id=0ce3be83-e22d-4b75-9425-c5dc215ef2c4 tenant1-mempool_configured-6'

From above steps, most likely compute lock/unlock have caused it.

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-07-31:

#11

In the lab that I looked at where this was reported to occur, the l3 agent wasn't reporting due to [1] so this was closed as a duplicate of that issue.

In that case, you can see this error [2] in the neutron-l3-agent logs.

If you don't see that error, then it is not the same issue being seen that this was closed as a duplicate of, so a new launchpad should be opened for the new occurrence of instances failing to launch.

[1] https://bugs.launchpad.net/starlingx/+bug/1835807
[2] 2019-07-31 16:11:03.188 431 ERROR neutron_lib.rpc [req-5d4ced1d-dc2a-4eea-bec4-f939b399f285 - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 29 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 0829996d71aa48caa8e8b3a711be173e

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-05-26:

#12

Marking as Won't Fix to match the duplicate LP.

Changed in starlingx:
status:	Triaged → Won't Fix

Peng Peng (ppeng) on 2020-06-04

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1835807 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.