nova-live-migration and nova-grenade-multinode fail due to n-cpu restarting slowly after being reconfigured for ceph
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Lee Yarwood | ||
Pike |
Fix Released
|
Medium
|
Lee Yarwood | ||
Queens |
Fix Released
|
Medium
|
Lee Yarwood | ||
Rocky |
Fix Released
|
Medium
|
Lee Yarwood | ||
Stein |
Fix Released
|
Medium
|
Lee Yarwood | ||
Train |
Fix Released
|
Medium
|
Lee Yarwood |
Bug Description
Description
===========
$subject, it appears the current check of using grep to find active n-cpu processes isn't enough and we actually need to wait for the services to report as UP before starting to run Tempest.
In the following we can see Tempest starting at 2020-03-13 13:01:19.528 while n-cpu within the instance isn't marked as UP for another ~20 seconds:
https:/
I've only seen this on stable/pike at present but it could potentially hit all branches with slow enough CI nodes.
Steps to reproduce
==================
Run nova-live-migration on slow CI nodes.
Expected result
===============
nova/tests/
Actual result
=============
nova/tests/
Environment
===========
1. Exact version of OpenStack you are running. See the following
list for all releases: http://
stable/pike but you be present on other branches with slow enough CI nodes.
2. Which hypervisor did you use?
(For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
What's the version of that?
Libvirt / KVM.
2. Which storage type did you use?
(For example: Ceph, LVM, GPFS, ...)
What's the version of that?
N/A
3. Which networking type did you use?
(For example: nova-network, Neutron with OpenVSwitch, ...)
N/A
Logs & Configs
==============
Mar 13 13:01:39.170201 ubuntu-
Mar 13 13:01:39.255008 ubuntu-
Mar 13 13:01:39.322508 ubuntu-
Changed in nova: | |
assignee: | nobody → Lee Yarwood (lyarwood) |
status: | New → In Progress |
tags: | added: live-migration testing |
Changed in nova: | |
importance: | Undecided → Medium |
tags: | added: gate-failure |
I think I just hit this on the master branch on the nova-grenade- multinode job [1].
The error in job-output.txt was:
tempest. api.compute. admin.test_ live_migration_ negative. LiveMigrationNe gativeTest. test_invalid_ host_for_ migration [7.432917s] ... FAILED
and
tempest. exceptions. BuildErrorExcep tion: Server e6d27a14- ee54-47b0- b44e-3d8db0d99e 85 failed to build and is in ERROR status
I traced the server and found it was scheduled in screen-n-sch.txt:
Mar 18 21:02:56.722990 ubuntu- bionic- ovh-gra1- 0015309365 nova-scheduler[ 15403]: DEBUG nova.scheduler. manager [None req-58795901- f2c5-4175- a590-c487e68f20 9d tempest- LiveMigrationNe gativeTest- 1769972548 tempest- LiveMigrationNe gativeTest- 1769972548] Starting to schedule for instances: ['e6d27a14- ee54-47b0- b44e-3d8db0d99e 85'] {{(pid=16667) select_destinations /opt/stack/ new/nova/ nova/scheduler/ manager. py:134} }
Mar 18 21:02:57.031615 ubuntu- bionic- ovh-gra1- 0015309365 nova-scheduler[ 15403]: DEBUG nova.scheduler. utils [None req-58795901- f2c5-4175- a590-c487e68f20 9d tempest- LiveMigrationNe gativeTest- 1769972548 tempest- LiveMigrationNe gativeTest- 1769972548] Attempting to claim resources in the placement API for instance e6d27a14- ee54-47b0- b44e-3d8db0d99e 85 {{(pid=16667) claim_resources /opt/stack/ new/nova/ nova/scheduler/ utils.py: 1175}}
Mar 18 21:02:57.490996 ubuntu- bionic- ovh-gra1- 0015309365 nova-scheduler[ 15403]: DEBUG nova.scheduler. filter_ scheduler [None req-58795901- f2c5-4175- a590-c487e68f20 9d tempest- LiveMigrationNe gativeTest- 1769972548 tempest- LiveMigrationNe gativeTest- 1769972548] [instance: e6d27a14- ee54-47b0- b44e-3d8db0d99e 85] Selected host: (ubuntu- bionic- ovh-gra1- 0015309367, ubuntu- bionic- ovh-gra1- 0015309367) ram: 7273MB disk: 51200MB io_ops: 0 instances: 0 {{(pid=16667) _consume_ selected_ host /opt/stack/ new/nova/ nova/scheduler/ filter_ scheduler. py:354} }
But then when I went to go find it in nova-compute, I found this in screen-n-cpu.txt on the subnode:
Mar 18 21:03:01.566901 ubuntu- bionic- ovh-gra1- 0015309367 nova-compute[3783]: DEBUG nova.compute. manager [None req-001a485d- 3f4a-43fa- 8719-77d0f433b6 09 None None] [instance: e6d27a14- ee54-47b0- b44e-3d8db0d99e 85] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_ out_instances_ whose_build_ was_interrupted /opt/stack/ old/nova/ nova/compute/ manager. py:1441} }
The server never got a chance to finish building because nova-compute was starting up (init_host) (!!) right in the middle of the build.
Looking back at job-output.txt, I see the last messages were about checking and restarting nova-compute:
2020-03-18 21:02:32.510701 | primary | 2020-03-18 21:02:32.510 | check compute processes before restart
So it's trying to run the test before nova-compute has finished starting and come back up.
[1] https:/ /zuul.opendev. org/t/openstack /build/ 2caa70137d4f438 b90cdd679d99ebe 05