Deploying sometime results in "Services not running that should be: nova-compute"

Bug #1861094 reported by Liam Young
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Triaged
High
Unassigned
OpenStack Nova Compute Charm
Triaged
High
Unassigned

Bug Description

When deploying trusty-mitaka the nova-compute units are sometimes stuck in a blocked state with the message "Services not running that should be: nova-compute".

This is caused by the nova-cloud-controller not being attached to the rabbitmq-server when the last nova-compute hook fires. The nova-compute service shuts down if it does not get a reply to a message within a minute. If the nova-compute units have completed all their hooks then there is no subsequent restart of the nova-compute service and it stays down.

If systemd is managing the service then the issue is masked be systemd restarting the service. This could also timeout and trigger the same issue.

Tags: cdo-qa
Liam Young (gnuoy)
Changed in charm-nova-compute:
assignee: nobody → Liam Young (gnuoy)
Liam Young (gnuoy)
summary: - Deploying trusty mitaka sometime results in "Services not running that
- should be: nova-compute"
+ Deploying sometime results in "Services not running that should be:
+ nova-compute"
description: updated
Revision history for this message
Liam Young (gnuoy) wrote :

This bug is easy to reproduce. Deploy a bundle and omit the nova-cloud-controller <-> rabbitmq-server relation.

Once the deployment has settled the nova-compute service will either be down (upstart) or continually restarting (systemd). Add the nova-cloud-controller <-> rabbitmq-server relation and nova-compute will stay down (upstart). On a systemd machine a subsequent restart will probably bring nova-compute back if it hasn't timedout.

Liam Young (gnuoy)
Changed in charm-nova-compute:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I wonder if the solution is to hold off (in nova-cloud-controller) setting anything on the nova-compute relations until the rabbitmq-server relation is established; at that point then kick the nova-compute units (by set relation) and it should work?

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

I think this hits us as well with systemd. Here on xenial-ocata for example: http://osci:8080/view/MojoMatrix/job/mojo_runner/22975/console

nova-compute/1 blocked idle 48 172.17.107.74 Services not running that should be: nova-compute

2020-06-29 11:01:56.798 21721 ERROR oslo_service.service [req-7ca5cf2c-8eb8-4909-afe9-3478aa7043be - - - - -] Error starting thread.
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service Traceback (most recent call last):
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_service/service.py", line 722, in run_service
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service service.start()
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 148, in start
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service ctxt, self.host, self.binary)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service args, kwargs)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 239, in object_class_action_versions
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service args=args, kwargs=kwargs)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service retry=self.retry)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 97, in _send
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service timeout=timeout, retry=retry)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 458, in send
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service retry=retry)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in _send
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service result = self._waiter.wait(msg_id, timeout)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in wait
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service message = self.waiters.get(msg_id, timeout=timeout)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 238, in get
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service 'to message ID %s' % msg_id)
2020-06-29 11:01:56.798 21721 ERROR oslo_service.service MessagingTimeout: Timed out waiting for a reply to message ID eb9efd8073734a0fa9bc9e015532b486

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Download full text (5.0 KiB)

Happened here right after a deployment on focal: https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-mysql-router/748714/4/6723/index.html

2020-08-29 10:08:56.024 95537 ERROR oslo_service.service [req-9ecebbf6-d8b1-4304-9258-f6e9e0c9b6a2 - - - - -] Error starting thread.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting fo
r a reply to message ID 1491906c55764d4fbd478c75640fa0aa
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service Traceback (most recent call last):
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 405, in get
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service return self._queues[msg_id].get(block=True, timeout=timeout)
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 322, in get
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service return waiter.wait()
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/eventlet/queue.py", line 141, in wait
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service return get_hub().switch()
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 298, in switch
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service return self.greenlet.switch()
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service _queue.Empty
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service During handling of the above exception, another exception occurred:
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service Traceback (most recent call last):
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 810, in run_service
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service service.start()
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/service.py", line 172, in start
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service self.manager.init_host()
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1399, in init_host
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service instances = objects.InstanceList.get_by_host(
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_versionedobjects/base.py", line 175, in wrapper
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service result = cls.indirection_api.object_class_action_versions(
2020-08-29 10:08:56.024 95537 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
2020-08-29 10:08:56.024 95537 ERROR oslo_service.serv...

Read more...

Frode Nordahl (fnordahl)
tags: added: unstable-test
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Frode, I don't think this is an unstable-test. I think it's an error in the nova-cloud-controller charm, in that it shouldn't allow the nova-compute's relations to complete until it has the rabbitmq-server relation sorted out. i.e. I think it's a charm bug in nova-cloud-controller.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Many of our charm gates rely on deploying workloads on top of a cloud that first has to be deployed, the same charm gates are unstable due to this bug. From the perspective of anyone attempting to land a patch on a charm with nova-compute/nova-cloud-controller charms as part of their gate bundles this bug manifests itself as a unstable test.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I respectfully disagree. There's nothing that can be done to the tests to fix the problem, other than delaying deploying the nova-compute units until the nova-cloud-controller/mysql charms are stable and installed, and this isn't practical as it wouldn't match deployment in the field.

The problem is the in the charms themselves. My understanding is that the tag "unstable-test" is reserved for tests themselves that cause issues, not issues in the charms themselves.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

In that case I think we might need another bug tag, because what I'm really interested in is individual charm bugs that make multiple charm gates unstable, which this particular bug is a clear example of. Let's discuss in standup.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Based on our conversation surrounding this bug I'm removing the tag and increasing the priority so that we get around to fixing the underlying issue without loosing track of it as "just another test code issue" as it in reality is a real charm bug as you have pointed out.

tags: removed: unstable-test
Changed in charm-nova-compute:
importance: High → Critical
milestone: none → 20.10
Liam Young (gnuoy)
Changed in charm-nova-compute:
assignee: Liam Young (gnuoy) → nobody
Revision history for this message
Liam Young (gnuoy) wrote :

I think this is one of those bugs where distinct issues can result in the same symptom. I recently saw this in the octavia gate test ( bionic-train-ha-ovn ). The oslo_messaging.exceptions.MessagingTimeout exception does occur if the nova conductor is down but that does not cause the nova-compute daemon to die nor does it cause the service to be down according to systemd. As soon as the conductor is brought back up the nova-compute daemon recovers as expected.

The reproducer mentioned in comment #1 does not work for xenial ocata (comment #3), as mentioned in the bug description this probably down to systemd restarting nova-compute.

David Ames (thedac)
Changed in charm-nova-compute:
milestone: 20.10 → 21.01
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
David Ames (thedac)
Changed in charm-nova-compute:
milestone: 21.01 → none
Revision history for this message
Frode Nordahl (fnordahl) wrote :
tags: added: cdo-qa
Changed in charm-nova-compute:
status: Confirmed → Triaged
Changed in charm-nova-cloud-controller:
status: New → Triaged
importance: Undecided → High
Changed in charm-nova-compute:
importance: Critical → High
Revision history for this message
Jeffrey Chang (modern911) wrote :
Download full text (7.4 KiB)

SQA saw this in some of recent Yoga runs, posting some links and error below.
https://solutions.qa.canonical.com/testruns/94e7f7c0-6ac7-490b-a349-6048851aed10
https://solutions.qa.canonical.com/testruns/7c5c3299-895a-4da3-93ab-fc0945eee4c0
https://solutions.qa.canonical.com/testruns/c09547ef-0fb0-4a7e-bb85-62fe4c8e085d

We have some config issue about cpu-shared-set, cpu-dedicated-set. Not sure if they are related.
Will fix this see.

Errors from nova-compute.log

2023-07-13 05:11:19.100 34790 ERROR oslo.messaging._drivers.impl_rabbit [req-a12e35e8-e3a8-4b1d-bfa0-1709fd2f938c - - - - -] Connection failed: [Errno 111] ECONNREFUSED (retrying in 9.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED

2023-07-13 06:13:28.436 1095570 WARNING nova.conductor.api [req-ccd0cb45-a009-45ba-ad2e-7bd8f9dd1e2a - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: oslo_m
essaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 578237a0fd7b466babfdf5fb382f49d9

2023-07-13 06:14:08.622 1133261 CRITICAL nova [req-31315d1a-f5d4-4fa9-80d1-96fbe385941f - - - - -] Unhandled error: keystoneauth1.exceptions.catalog.EndpointNotFound: ['internal', 'public'] endpoint for placement service in RegionOne region not found
2023-07-13 06:14:08.622 1133261 ERROR nova Traceback (most recent call last):
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/bin/nova-compute", line 10, in <module>
2023-07-13 06:14:08.622 1133261 ERROR nova sys.exit(main())
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/lib/python3/dist-packages/nova/cmd/compute.py", line 59, in main
2023-07-13 06:14:08.622 1133261 ERROR nova server = service.Service.create(binary='nova-compute',
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/lib/python3/dist-packages/nova/service.py", line 252, in create
2023-07-13 06:14:08.622 1133261 ERROR nova service_obj = cls(host, binary, topic, manager,
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/lib/python3/dist-packages/nova/service.py", line 116, in __init__
2023-07-13 06:14:08.622 1133261 ERROR nova self.manager = manager_class(host=self.host, *args, **kwargs)
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 631, in __init__
2023-07-13 06:14:08.622 1133261 ERROR nova self.reportclient = report.SchedulerReportClient()
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/lib/python3/dist-packages/nova/scheduler/client/report.py", line 234, in __init__
2023-07-13 06:14:08.622 1133261 ERROR nova self._client = self._create_client()
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/lib/python3/dist-packages/nova/scheduler/client/report.py", line 277, in _create_client
2023-07-13 06:14:08.622 1133261 ERROR nova client = self._adapter or utils.get_sdk_adapter('placement')
2023-07-13 06:14:08.622 1133261 ERROR nova File "/usr/lib/python3/dist-packages/nova/utils.py", line 985, in get_sdk_adapter
2023-07-13 06:14:08.622 1133261 ERROR nova return getattr(conn, service_type)
2023-07-13 06:14:...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.