Bug #1631918 “_determine_version_cap fails with MessagingTimeout...” : Bugs : OpenStack Compute (nova)

John Garbutt (johngarbutt) on 2016-10-10

Changed in nova:
assignee:	nobody → John Garbutt (johngarbutt)
importance:	Undecided → Medium
status:	New → In Progress
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-10: Related fix proposed to nova (master)

#1

Related fix proposed to branch: master
Review: https://review.openstack.org/384428

Revision history for this message

John Garbutt (johngarbutt) wrote on 2016-10-10:

#2

Download full text (3.3 KiB)

I think this is the key chain of events:
2016-10-08 11:05:41.202 16059 ERROR nova Traceback (most recent call last):
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/bin/nova-compute", line 10, in <module>
2016-10-08 11:05:41.202 16059 ERROR nova sys.exit(main())
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/cmd/compute.py", line 63, in main
2016-10-08 11:05:41.202 16059 ERROR nova db_allowed=CONF.conductor.use_local)
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/service.py", line 218, in create
2016-10-08 11:05:41.202 16059 ERROR nova db_allowed=db_allowed)
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/service.py", line 92, in __init__
<snip>
2016-10-08 11:05:41.202 16059 ERROR nova File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 344, in _determine_version_cap
2016-10-08 11:05:41.202 16059 ERROR nova context.get_admin_context(), 'nova-compute’)
<snip>
2016-10-08 11:05:41.202 16059 ERROR nova MessagingTimeout: Timed out waiting for a reply to message ID 85abb28f3095431eaa819b6fedb2f932
2016-10-08 11:05:43.622 17413 WARNING nova.virt.libvirt.driver [req-b6d46bac-0d67-48dd-83fc-8d684fb1cf3c - - - - -] Cannot update service status on host "ubuntu-trusty-osic-cloud1-s3500-4816864" since it is not registered.

The first failure appears to be during the setting up the RPC API, we try to make a DB call, and it fails. and we never try again it would seem:
https://github.com/openstack/nova/blob/master/nova/compute/rpcapi.py#L343
We need to better handle failures during the above call.

The second one is more of a worry. I can’t quite tell who restarted nova-compute or if Nova has some kind of failed restart going on here? I need to dig into that. It looks like we fail on the first service update, then all future service update attempts fail because service doesn’t exist.

In the working ones, you see:
2016-10-10 02:27:48.015 17473 WARNING nova.virt.libvirt.driver [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Cannot update service status on host "ubuntu-trusty-osic-cloud1-s3500-4823960" since it is not registered.
2016-10-10 02:27:48.106 17473 ERROR nova.compute.manager [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] No compute node record for host ubuntu-trusty-osic-cloud1-s3500-4823960
2016-10-10 02:27:48.109 17473 WARNING nova.compute.monitors [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2016-10-10 02:27:48.168 17473 WARNING nova.compute.resource_tracker [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] No compute node record for ubuntu-trusty-osic-cloud1-s3500-4823960:ubuntu-trusty-osic-cloud1-s3500-4823960
2016-10-10 02:27:48.191 17473 INFO nova.compute.resource_tracker [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Compute_service record created for ubuntu-trusty-osic-cloud1-s3500-4823960:u...

I think this is the key chain of events:
2016-10-08 11:05:41.202 16059 ERROR nova Traceback (most recent call last):
2016-10-08 11:05:41.202 16059 ERROR nova   File "/openstack/venvs/nova-testing/bin/nova-compute", line 10, in <module>
2016-10-08 11:05:41.202 16059 ERROR nova     sys.exit(main())
2016-10-08 11:05:41.202 16059 ERROR nova   File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/cmd/compute.py", line 63, in main
2016-10-08 11:05:41.202 16059 ERROR nova     db_allowed=CONF.conductor.use_local)
2016-10-08 11:05:41.202 16059 ERROR nova   File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/service.py", line 218, in create
2016-10-08 11:05:41.202 16059 ERROR nova     db_allowed=db_allowed)
2016-10-08 11:05:41.202 16059 ERROR nova   File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/service.py", line 92, in __init__
<snip>
2016-10-08 11:05:41.202 16059 ERROR nova   File "/openstack/venvs/nova-testing/local/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 344, in _determine_version_cap
2016-10-08 11:05:41.202 16059 ERROR nova     context.get_admin_context(), 'nova-compute’)
<snip>
2016-10-08 11:05:41.202 16059 ERROR nova MessagingTimeout: Timed out waiting for a reply to message ID 85abb28f3095431eaa819b6fedb2f932
2016-10-08 11:05:43.622 17413 WARNING nova.virt.libvirt.driver [req-b6d46bac-0d67-48dd-83fc-8d684fb1cf3c - - - - -] Cannot update service status on host "ubuntu-trusty-osic-cloud1-s3500-4816864" since it is not registered.

The first failure appears to be during the setting up the RPC API, we try to make a DB call, and it fails.  and we never try again it would seem:
https://github.com/openstack/nova/blob/master/nova/compute/rpcapi.py#L343
We need to better handle failures during the above call.

The second one is more of a worry. I can’t quite tell who restarted nova-compute or if Nova has some kind of failed restart going on here? I need to dig into that. It looks like we fail on the first service update, then all future service update attempts fail because service doesn’t exist.

In the working ones, you see:
2016-10-10 02:27:48.015 17473 WARNING nova.virt.libvirt.driver [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Cannot update service status on host "ubuntu-trusty-osic-cloud1-s3500-4823960" since it is not registered.
2016-10-10 02:27:48.106 17473 ERROR nova.compute.manager [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] No compute node record for host ubuntu-trusty-osic-cloud1-s3500-4823960
2016-10-10 02:27:48.109 17473 WARNING nova.compute.monitors [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2016-10-10 02:27:48.168 17473 WARNING nova.compute.resource_tracker [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] No compute node record for ubuntu-trusty-osic-cloud1-s3500-4823960:ubuntu-trusty-osic-cloud1-s3500-4823960
2016-10-10 02:27:48.191 17473 INFO nova.compute.resource_tracker [req-cd630ce1-c7b8-402c-8ad9-4303b3f737c9 - - - - -] Compute_service record created for ubuntu-trusty-osic-cloud1-s3500-4823960:ubuntu-trusty-osic-cloud1-s3500-4823960

Now quite why things are different between the two is confusing me. Something is auto healing, and I am not seeing where.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-10: Change abandoned on nova (master)

#3

Change abandoned by John Garbutt (<email address hidden>) on branch: master
Review: https://review.openstack.org/384428
Reason: This is a bad idea, the service does die correctly, it was the init script that caused it to come back to life.

Revision history for this message

John Garbutt (johngarbutt) wrote on 2016-10-10:

#4

Turns out, the init script is restarting the process after 150 seconds of failure.

This races tempest retries, and sometimes means the tests pass (tempest gets retried 3 times I think).

So nova-compute dies when nova-conductor is not running, which is OK. Might be better if we retried a few times once we are up, but seems like that would need a bit of re-work.

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-23:

#5

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status:	In Progress → New
assignee:	John Garbutt (johngarbutt) → nobody

Revision history for this message

Sean Dague (sdague) wrote on 2017-07-28:

#6

The bug report seems to indicate this isn't really an issue any more?

Changed in nova:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-09-27:

#7

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

OpenStack Compute (nova)

_determine_version_cap fails with MessagingTimeout when starting nova-compute

Bug Description

Other bug subscribers

Remote bug watches