_determine_version_cap fails with MessagingTimeout when starting nova-compute

Bug #1631918 reported by John Garbutt
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Medium
Unassigned

Bug Description

On a fresh deployment, there are issues when starting up nova-compute, before nova-conductor has started responding to RPC requests.

The first is a MessagingTimeout in the _determine_version_cap call, that is triggered by creating the ComputeManager class.

This cases the process to exit, but it doesn't seem to quite fully exit the process.

It seems like this happens only when CONF.upgrade_levels.compute = "auto"

This was spotted in this OSA change:
https://review.openstack.org/#/c/367752

Changed in nova:
assignee: nobody → John Garbutt (johngarbutt)
importance: Undecided → Medium
status: New → In Progress
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/384428

Revision history for this message
John Garbutt (johngarbutt) wrote :
Download full text (3.3 KiB)

I think this is the key chain of events:
2016-10-08 11:05:41.202 16059 ERROR nova Traceback (most recent call last):
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/bin/nova-compute", line 10, in <module>
2016-10-08 11:05:41.202 16059 ERROR nova sys.exit(main())
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/cmd/compute.py", line 63, in main
2016-10-08 11:05:41.202 16059 ERROR nova db_allowed=CONF.conductor.use_local)
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/service.py", line 218, in create
2016-10-08 11:05:41.202 16059 ERROR nova db_allowed=db_allowed)
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/service.py", line 92, in __init__
<snip>
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 344, in _determine_version_cap
2016-10-08 11:05:41.202 16059 ERROR nova context.get_admin_context(), 'nova-compute’)
<snip>
2016-10-08 11:05:41.202 16059 ERROR nova MessagingTimeout: Timed out waiting for a reply to message ID 85abb28f3095431eaa819b6fedb2f932
2016-10-08 11:05:43.622 17413 WARNING nova.virt.libvirt.driver [req-b6d46bac-0d67-48dd-83fc-8d684fb1cf3c - - - - -] Cannot update service status on host "ubuntu-trusty-osic-cloud1-s3500-4816864" since it is not registered.

The first failure appears to be during the setting up the RPC API, we try to make a DB call, and it fails. and we never try again it would seem:
https://github.com/openstack/nova/blob/master/nova/compute/rpcapi.py#L343
We need to better handle failures during the above call.

The second one is more of a worry. I can’t quite tell who restarted nova-compute or if Nova has some kind of failed restart going on here? I need to dig into that. It looks like we fail on the first service update, then all future service update attempts fail because service doesn’t exist.

In the working ones, you see:
2016-10-10 02:27:48.015 17473 WARNING nova.virt.libvirt.driver [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Cannot update service status on host "ubuntu-trusty-osic-cloud1-s3500-4823960" since it is not registered.
2016-10-10 02:27:48.106 17473 ERROR nova.compute.manager [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] No compute node record for host ubuntu-trusty-osic-cloud1-s3500-4823960
2016-10-10 02:27:48.109 17473 WARNING nova.compute.monitors [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2016-10-10 02:27:48.168 17473 WARNING nova.compute.resource_tracker [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] No compute node record for ubuntu-trusty-osic-cloud1-s3500-4823960:ubuntu-trusty-osic-cloud1-s3500-4823960
2016-10-10 02:27:48.191 17473 INFO nova.compute.resource_tracker [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Compute_service record created for ubuntu-trusty-osic-cloud1-s3500-4823960:u...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by John Garbutt (<email address hidden>) on branch: master
Review: https://review.openstack.org/384428
Reason: This is a bad idea, the service does die correctly, it was the init script that caused it to come back to life.

Revision history for this message
John Garbutt (johngarbutt) wrote :

Turns out, the init script is restarting the process after 150 seconds of failure.

This races tempest retries, and sometimes means the tests pass (tempest gets retried 3 times I think).

So nova-compute dies when nova-conductor is not running, which is OK. Might be better if we retried a few times once we are up, but seems like that would need a bit of re-work.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → New
assignee: John Garbutt (johngarbutt) → nobody
Revision history for this message
Sean Dague (sdague) wrote :

The bug report seems to indicate this isn't really an issue any more?

Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.