After active controller reboot, VM boot up failed with error "Failed to allocate the network(s), not rescheduling"

Bug #1836928 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
High
Joseph Richard

Bug Description

Brief Description
-----------------
stx-openstack application applied successfully, then boot up VMs, but failed.
error msg: "Failed to allocate the network(s), not rescheduling"

It seems different from https://bugs.launchpad.net/starlingx/+bug/1833622

Severity
--------
Critical

Steps to Reproduce
------------------
as description

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name: WCP_63-66

Branch/Pull Time/Commit
-----------------------
stx master as of 20190717T013000Z

Last Pass
---------
20190712T013000Z

Timestamp/Logs
--------------
[2019-07-17 09:32:46,241] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-17 09:32:47,775] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+

[2019-07-17 09:33:50,719] 301 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=0ba26c17-f20b-4653-b98c-3cccedbd758e --key-name=keypair-tenant1 --poll --flavor=0e193dd4-87ba-4cb9-a6ac-d2dd6cf55aef --nic net-id=f6712dfb-9cba-4d3b-92f4-6cd43c64cd8d --nic net-id=02cf1817-9255-4eb4-a3d9-8559f6a395c1 tenant1-vm-1'
[2019-07-17 09:39:17,028] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-------------------------------------------------+
| Property | Value |
+--------------------------------------+-------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | hztD5j7XeXbF |
| config_drive | |
| created | 2019-07-17T09:33:54Z |
| description | - |
| flavor:disk | 2 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {"hw:mem_page_size": "2048"} |
| flavor:original_name | flavor-default-size2 |
| flavor:ram | 1024 |
| flavor:swap | 0 |
| flavor:vcpus | 1 |
| hostId | |
| id | 1f4bafed-2ef4-4044-9486-5223de2f3cdb |
| image | Attempt to boot from volume - no image supplied |
| key_name | keypair-tenant1 |
| locked | False |
| metadata | {} |
| name | tenant1-vm-1 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| server_groups | [] |
| status | BUILD |
| tags | [] |
| tenant_id | d8f07a22e9c3411782c491a5f1bcf9d8 |
| trusted_image_certificates | - |
| updated | 2019-07-17T09:33:54Z |
| user_id | 4c13574c55d944b3bd3258169ebe4c06 |
+--------------------------------------+-------------------------------------------------+

Server building... 0% complete
Server building... 0% complete
Error building server
ERROR (ResourceInErrorState):

[2019-07-17 09:39:17,133] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show 1f4bafed-2ef4-4044-9486-5223de2f3cdb'

| created | 2019-07-17T09:33:54Z |
| fault | {u'message': u'Build of instance 1f4bafed-2ef4-4044-9486-5223de2f3cdb aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1984, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2322, in _build_and_run_instance\n reason=msg)\n', u'created': u'2019-07-17T09:39:14Z'} |
| flavor | flavor-default-size2 (0e193dd4-87ba-4cb9-a6ac-d2dd6cf55aef)

[2019-07-17 09:45:56,018] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-17 09:45:57,479] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | remove-failed | operation aborted, check logs for detail |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------------+------------------------------------------+

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
description: updated
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Peng, can you please provide more info on why you believe this bug is different from https://bugs.launchpad.net/starlingx/+bug/1833622 ? Thanks

description: updated
Changed in starlingx:
status: New → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

1833622 marked as duplicated with 1833609 - stx-openstack application stuck at processing chart: osh-openstack-ceph-rgw after unlock standby controller

But when this bug (1836928) happened, stx-openstack application status was applied.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Peng, please add a new set of logs from the new occurrence.

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_63_66
Load: 20190718T013000Z

from 09:29 to 11:29, sxt-openstack application status is applied.
But during this period, all VMs boot up failed

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@ppeng, where there any node reboots (controller or compute) triggered by the sanity suite prior to the VM boot up failures? Were there any successful VM launches in this sanity run?

tags: added: stx.sanity
Revision history for this message
Peng Peng (ppeng) wrote :

Yes, there was active controller reboot before this issue happen, and there is a success VM boot up prior the reboot. The detail timestamp is as below

09:30 VM boot up success
09:37 active controller-1 reboot
09:45 controller-1 fully up
09:46 VM boot up failed as description
then the rest VM booy up all failed.
11:29 last VM boot up failed

test suite checked application-list during these period. stx-openstack application status all show applied/completed

Revision history for this message
Ghada Khalil (gkhalil) wrote :

If this issue started happening after a node reboot, then you are hitting:
https://bugs.launchpad.net/starlingx/+bug/1835807

Marking as a duplicate.

Changed in starlingx:
importance: Undecided → High
status: Incomplete → Triaged
tags: added: stx.2.0 stx.networking
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
summary: - VM boot up failed, fault msg "Failed to allocate the network(s), not
- rescheduling"
+ After active controller reboot, VM boot up failed with error "Failed to
+ allocate the network(s), not rescheduling"
Revision history for this message
Yang Liu (yliu12) wrote :
Download full text (5.4 KiB)

In latest sanity, it does not look like it is caused by force reboot of controller.
Here are the time stamps:
# Force rebooted active controller controller-1
[2019-07-24 09:35:04,259] 180 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1

# Launched two vms and swact active controller
[2019-07-24 09:44:38,117] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

# Existing vms are reachable and all network agents are up after above operations
+--------------------------------------+--------------------+-----------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+-----------+-------------------+-------+-------+---------------------------+
| 21085941-338c-47f1-b438-8ce2c5d080a9 | NIC Switch agent | compute-1 | None | :-) | UP | neutron-sriov-nic-agent |
| 4c32d6a6-f35b-471d-a505-0a6324184d3b | Metadata agent | compute-1 | None | :-) | UP | neutron-metadata-agent |
| 5129133a-422a-40e4-b095-d8407ae03a49 | Metadata agent | compute-0 | None | :-) | UP | neutron-metadata-agent |
| 6c171c45-186c-46fa-80e0-dcbe4f91d624 | NIC Switch agent | compute-0 | None | :-) | UP | neutron-sriov-nic-agent |
| 876e46cf-3e36-4782-9537-72def87df324 | L3 agent | compute-0 | nova | :-) | UP | neutron-l3-agent |
| 9250f08b-51f8-4775-abbc-f6dbd6cbdf4a | DHCP agent | compute-1 | nova | :-) | UP | neutron-dhcp-agent |
| 94d53080-ef97-4405-a4b9-a5bede967423 | Open vSwitch agent | compute-1 | None | :-) | UP | neutron-openvswitch-agent |
| a4f454c2-ab64-405c-827e-fa719f39794e | DHCP agent | compute-0 | nova | :-) | UP | neutron-dhcp-agent |
| d2cc032e-3221-4e90-9179-4dde96ee7618 | L3 agent | compute-1 | nova | :-) | UP | neutron-l3-agent |
| f661177b-a1ef-4cb0-aa88-ce6fd66c0e46 | Open vSwitch agent | compute-0 | None | :-) | UP | neutron-openvswitch-agent |
+--------------------------------------+--------------------+-----------+-------------------+-------+-------+---------------------------+
controller-1:~$

# lock/unlock compute-1
[2019-07-24 09:48:06,631] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-1'

[2019-07-24 09:48:38,121] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin -...

Read more...

Revision history for this message
Joseph Richard (josephrichard) wrote :

In the lab that I looked at where this was reported to occur, the l3 agent wasn't reporting due to [1] so this was closed as a duplicate of that issue.

In that case, you can see this error [2] in the neutron-l3-agent logs.

If you don't see that error, then it is not the same issue being seen that this was closed as a duplicate of, so a new launchpad should be opened for the new occurrence of instances failing to launch.

[1] https://bugs.launchpad.net/starlingx/+bug/1835807
[2] 2019-07-31 16:11:03.188 431 ERROR neutron_lib.rpc [req-5d4ced1d-dc2a-4eea-bec4-f939b399f285 - - - - -] Timeout in RPC method get_host_ha_router_count. Waiting for 29 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 0829996d71aa48caa8e8b3a711be173e

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Won't Fix to match the duplicate LP.

Changed in starlingx:
status: Triaged → Won't Fix
Peng Peng (ppeng)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.