VM boot up ERROR by Failed to allocate the network(s), not rescheduling

Bug #1833622 reported by Peng Peng
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
zhipeng liu

Bug Description

Brief Description
-----------------
Failed to boot a ubuntu_14 vm with virtio vifs. Fault msg shows: Failed to allocate the network(s), not rescheduling.

Severity
--------
Critical

Steps to Reproduce
------------------
Create a favor dedicated cpu policy
Create a volume from ubuntu_14 image
Boot a ubuntu_14 vm with virtio vifs from above flavor and volume

TC-name: test_ping_vms.py::test_ping_between_two_vms[ubuntu_14-virtio-virtio]

Expected Behavior
------------------
VM boot up success

Actual Behavior
----------------
VM boot up failed

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system

Lab-name: WCP_71-75

Branch/Pull Time/Commit
-----------------------
stx master as of 20190620T013000Z

Last Pass
---------
Lab: WCP_113_121
Load: 20190620T013000Z

Timestamp/Logs
--------------
[2019-06-20 18:44:25,213] 268 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=c3078008-36e8-4fad-8924-4e45e48d4bee --key-name=keypair-tenant1 --user-data=/home/sysadmin//userdataubuntu_1nic_cloud_init_if_conf.sh --poll --flavor=313e7a49-133d-4575-9642-f77dec94d702 --nic net-id=6fa3c05a-b858-42f9-a228-fd8005f8b52e tenant1-ubuntu_14_vifs-3'
[2019-06-20 18:49:40,364] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-STS:power_state | 0 |

Server building... 0% complete
Server building... 0% complete
Server building... 0% complete
Error building server
ERROR (ResourceInErrorState):
controller-0:~$
[2019-06-20 18:49:40,365] 268 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-06-20 18:49:40,468] 387 DEBUG MainThread ssh.expect :: Output:
1
controller-0:~$

[2019-06-20 18:49:40,470] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show f47ce471-f0ca-4dc3-93b6-53ff6cf9ba4b'
[2019-06-20 18:49:42,517] 387 DEBUG MainThread ssh.expect :: Output:
-+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | instance-00000016 |
                                                                                            |
| created | 2019-06-20T18:44:27Z |
| fault | {u'message': u'Build of instance f47ce471-f0ca-4dc3-93b6-53ff6cf9ba4b aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 1963, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 2301, in _build_and_run_instance\n reason=msg)\n', u'created': u'2019-06-20T18:49:36Z'} |
| flavor | dedicated (313e7a49-133d-4575-9642-f77dec94d702)

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
description: updated
description: updated
zhipeng liu (zhipengs)
Changed in starlingx:
assignee: nobody → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :

From build.info in the log, I can see below info.
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190620T013000Z"

This daily build has not started placement PO, will impact VM creation.
It can be duplicated to 1833497

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Zhipeng, I thought the placement pod not starting is only an issue with vswitch_type was set to none. This is confirmed by the sanity report Maria sent on 2019-06-20.

@Peng, Can you confirm the vswitch type from this occurrence?

tags: added: stx.containers stx.sanity
Changed in starlingx:
status: New → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

VSWITCH_TYPE : ovs-dpdk

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_71_75
Load: 20190623T233000Z

VSWITCH_TYPE : ovs-dpdk

New log attached

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Given that the issue is reproduced again with a more recent load, this proves that this is not a duplicate of https://bugs.launchpad.net/starlingx/+bug/1833497 as the fix for that bug was merged on 2019-06-20.

Zhipeng, please review the logs that are provided to determine why nova is returning a failure on the network allocation. If you need help for the networking team, please contact Forrest Zhao.

Changed in starlingx:
status: Incomplete → New
importance: Undecided → High
status: New → Triaged
Revision history for this message
Peng Peng (ppeng) wrote :

Last time VM success boot up was on
Lab: WCP_71_75
Load: 20190612T013000Z

Revision history for this message
zhipeng liu (zhipengs) wrote :

OK, I will check further and also involve Forrest's team.

Zhipeng

Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,

Is it possible to create a normal VM on the your multinode system? I notice that Nova can't connect to database in both ALL_NODES_20190620.210149.tar and ALL_NODES_20190624.161925.tar. If Nova really can't connect to database, no VM can be created.

You can find the log in the following path:
ALL_NODES_20190624.161925/controller-0_20190624.161925/var/log/containers/nova-scheduler-755c98cf54-bqhqg_openstack_nova-scheduler-e5a2a9a3b58aa25e1684c73caf9525af925e342396bd668a2c03e30f5173aac6.log
ALL_NODES_20190624.161925/controller-0_20190624.161925/var/log/containers/nova-conductor-6787ff97bc-qrklj_openstack_nova-conductor-94e02872794725b867494c9872c98a525e6bcdc196e13db2092ed138f61f7e01.log
ALL_NODES_20190620.210149/controller-1_20190620.210149/var/log/containers/nova-conductor-78c98f5d74-mbgxt_openstack_nova-conductor-9e807cb9e82a0def0ac6dcba6d19525d875290f265a500bfd6a72aa29e6714e7.log
ALL_NODES_20190620.210149/controller-1_20190620.210149/var/log/containers/nova-scheduler-8489d97b99-wls9s_openstack_nova-scheduler-bfc78d486a114c955b38fc8e8deaeb9de40fed9e5cc459ac7a4d6c5c04b05708.log

{"log":"2019-06-20 19:59:00.612 1 ERROR oslo_db.sqlalchemy.engines [req-70252872-3d13-472c-bfc7-a79e30e2ae48 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)\n","stream":"stdout","time":"2019-06-20T19:59:00.613656542Z"}
{"log":"2019-06-20 19:59:00.612 1 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):\n","stream":"stdout","time":"2019-06-20T19:59:00.613685821Z"}
{"log":"2019-06-20 19:59:00.612 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py\", line 73, in _connect_ping_listener\n","stream":"stdout","time":"2019-06-20T19:59:00.613689557Z"}
{"log":"2019-06-20 19:59:00.612 1 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))\n","stream":"stdout","time":"2019-06-20T19:59:00.613692685Z"}
{"log":"2019-06-20 19:59:00.612 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 912, in scalar\n","stream":"stdout","time":"2019-06-20T19:59:00.613695482Z"}
{"log":"2019-06-20 19:59:00.612 1 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar()\n","stream":"stdout","time":"2019-06-20T19:59:00.613698414Z"}
{"log":"2019-06-20 19:59:00.612 1 ERROR oslo_db.sqlalchemy.engines File \"/var/lib/openstack/lib/python2.7/site-packages/sqlalchemy/engine/base.py\", line 980, in execute\n","stream":"stdout","time":"2019-06-20T19:59:00.613701252Z"}

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Chenjie,
Sanity test already include VM creation, I also test VM creation last week, it should be OK.

Zhipeng

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (5.1 KiB)

The first VM creating was success during the whole test suite run, but the rest VM creating all failed.

[2019-06-20 17:52:41,191] 268 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --boot-volume=ed3cef11-0f1f-4005-b8aa-f4acc26df0b8 --key-name=keypair-tenant1 --poll --flavor=e5282554-4e4f-483b-9cd8-4491a6364a4c --nic net-id=6fa3c05a-b858-42f9-a228-fd8005f8b52e --nic net-id=ccaedbe7-495f-4aa1-a05c-c0d1bf611a1a tenant1-vm-1'
[2019-06-20 17:53:00,223] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-------------------------------------------------+
| Property | Value |
+--------------------------------------+-------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | m8XPHUJi5d3i |
| config_drive | |
| created | 2019-06-20T17:52:44Z |
| description | - |
| flavor:disk | 2 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {"hw:mem_page_size": "2048"} |
| flavor:original_name | flavor-default-size2 |
| flavor:ram | 1024 |
| flavor:swap | 0 |
| flavor:vcpus | 1 |
| hostId | |
| id | 1174d120-8b62-4fa9-af24-08734d1bc7f2 |
| image | Attempt to boot from volume - no image supplied |
| key_name | keypair-tenant1 ...

Read more...

Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,

Did you create the ubuntu_14 vm with virtio vifs for the first VM? For the resting VM, what kind of VM have you tried to create? I'm wondering whether this is related to "creating ubuntu_14 vm with virtio vifs" or not.

Revision history for this message
Peng Peng (ppeng) wrote :

The test suite tried to launch 11 VMs. The first VM launch was on tis-centos-guest image and it was success. The following 8 VMs launching were failed. The images were included,
tis-centos-guest vm with virtio vifs
ubuntu_14 vm with virtio vifs

The last two VMs launching were also success.

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi peng,
Need your confirm about below.

First VM launched with tis-cnetos-guest success
For follow 8 VMs, which image used, ubuntu_14?
For last 2 VMs, which image used?

Thanks!
Zhipeng

Revision history for this message
Peng Peng (ppeng) wrote :

The images were vary, please search "RegionOne boot" in attached execution log for detail.

Ghada Khalil (gkhalil)
tags: added: stx.2.0
Revision history for this message
ChenjieXu (midone) wrote :
Download full text (11.5 KiB)

Hi Pengpeng,

Based on the log TIS_AUTOMATION.log, this bug is caused by "system application-apply stx-openstack stuck". It's not caused by "boot a ubuntu_14 vm with virtio vifs".

The log can be summarized as following:
1.The first vm was booted successfully

2.the stx-openstack was re-applied for some reason. But unfortunately the process stuck at "processing chart: osh-openstack-ceph-rgw, overall completion: 44.0%".

[2019-06-20 18:43:13,476] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-06-20 18:43:14,997] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+------------------------------+-------------------------------+----------------+---------------+------------------------------------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+----------------+---------------+------------------------------------------------------------------------------------------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | upload-failed | Upload of application hello-kitty (1.0) failed: Command '['helm-upload', 'starlingx', u'/scratch/apps/hello-kitty/1.0/charts |
| | | | | | /hello-kitty.tgz']' returned non-zero exit status 1 |
| | | | | | |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-16-centos-stable- | armada-manifest | stx-openstack. | applying | processing chart: osh-openstack-ceph-rgw, overall completion: 44.0% |
| | versioned | | yaml | | |
| | | | | | ...

Revision history for this message
ChenjieXu (midone) wrote :

Hi all,

An email has been sent to Ngo, Tee who is responsible for the following bug and I ask him to take a look at this bug. The following bug is caused by rebooting the host before system application-apply successfully:
https://bugs.launchpad.net/starlingx/+bug/1833323

I think we need a better mechanism to deal with the failures of system application-apply.

Revision history for this message
ChenjieXu (midone) wrote :

Hi all,

According to Shuicheng, this bug may be a duplicate bug of the following bug:
https://bugs.launchpad.net/starlingx/+bug/1833609

Revision history for this message
ChenjieXu (midone) wrote :

Hi all,

The following logs are similar in the bug( https://bugs.launchpad.net/starlingx/+bug/1833609):

2019-06-24 10:51:58.243 36 INFO armada.handlers.wait [-] [chart=openstack-ceph-rgw]: Waiting for resource type=pod, namespace=openstack labels=release_group=osh-openstack-ceph-rgw required=True for 1800s^[[00m
2019-06-24 10:51:58.243 36 DEBUG armada.handlers.wait [-] [chart=openstack-ceph-rgw]: Starting to wait on: namespace=openstack, resource type=pod, label_selector=(release_group=osh-openstack-ceph-rgw), timeout=1800 _watch_resource_completions /usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py:362^[[00m

You can find the above logs in:
ALL_NODES_20190624.161925/controller-1_20190624.161925/var/log/armada/stx-openstack-apply.log
ALL_NODES_20190620.210149/controller-1_20190620.210149/var/log/armada/stx-openstack-apply.log

Revision history for this message
ChenjieXu (midone) wrote :

Hi all,

Shuicheng has confirmed that this bug should be a duplicate bug of the bug ( https://bugs.launchpad.net/starlingx/+bug/1833609 ) per the following log:

2019-06-24 10:51:58.243 36 INFO armada.handlers.wait [-] [chart=openstack-ceph-rgw]: Waiting for resource type=pod, namespace=openstack labels=release_group=osh-openstack-ceph-rgw required=True for 1800s^[[00m
2019-06-24 10:51:58.243 36 DEBUG armada.handlers.wait [-] [chart=openstack-ceph-rgw]: Starting to wait on: namespace=openstack, resource type=pod, label_selector=(release_group=osh-openstack-ceph-rgw), timeout=1800 _watch_resource_completions /usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py:362^[[00m

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug is fixed by: https://review.opendev.org/670445
Merged on 2019-07-17

Marking as Fix Released

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.