Scale: when periodic pool size is small and there is a lot of load the compute service goes down

Bug #1776621 reported by Gary Kotton on 2018-06-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Gary Kotton

Bug Description

When the nova power sync pool is exhausted the compute service will go down. This results in scale and performance tests failing.

2018-06-12 19:58:48.871 30126 WARNING oslo.messaging._drivers.impl_rabbit [req-196321bb-a11a-4e6e-a80a-544ecd093986 c3de6d9ec02c494d978330d8f1a64da1 d37803befc35418981f1f0b6dceec696 - default default] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer
2018-06-12 19:58:48.872 30126 WARNING oslo.messaging._drivers.impl_rabbit [req-196321bb-a11a-4e6e-a80a-544ecd093986 c3de6d9ec02c494d978330d8f1a64da1 d37803befc35418981f1f0b6dceec696 - default default] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer
2018-06-12 19:58:54.793 30126 WARNING oslo.messaging._drivers.impl_rabbit [req-196321bb-a11a-4e6e-a80a-544ecd093986 c3de6d9ec02c494d978330d8f1a64da1 d37803befc35418981f1f0b6dceec696 - default default] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer
2018-06-12 21:37:23.805 30126 DEBUG oslo_concurrency.lockutils [req-196321bb-a11a-4e6e-a80a-544ecd093986 c3de6d9ec02c494d978330d8f1a64da1 d37803befc35418981f1f0b6dceec696 - default default] Lock "compute_resources" released by "nova.compute.resource_tracker._update_available_resource" :: held 6004.943s inner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:288
2018-06-12 21:37:23.807 30126 ERROR nova.compute.manager [req-196321bb-a11a-4e6e-a80a-544ecd093986 c3de6d9ec02c494d978330d8f1a64da1 d37803befc35418981f1f0b6dceec696 - default default] Error updating resources for node domain-c7.fd3d2358-cc8d-4773-9fef-7a2713ac05ba.: MessagingTimeout: Timed out waiting for a reply to message ID 1eb4b1b40f0f4c66b0266608073717e8

root@controller01:/var/log/nova# vi nova-conductor.log.1
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager [req-77b5e1d7-a4b7-468e-98af-dfdfbf2fad7f 1b5d8da24b39464cb6736d122ccc0665 eb361d7bc9bd40059a2ce2848c985772 - default default] Failed to schedule instances: NoValidHost_Remote: No valid host was found. There are not enough hosts available.
Traceback (most recent call last):

  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 226, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/nova/scheduler/manager.py", line 153, in select_destinations
    allocation_request_version, return_alternates)

  File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 93, in select_destinations
    allocation_request_version, return_alternates)

  File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 245, in _schedule
    claimed_instance_uuids)

  File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 282, in _ensure_sufficient_hosts
    raise exception.NoValidHost(reason=reason)

NoValidHost: No valid host was found. There are not enough hosts available.
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager Traceback (most recent call last):
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 1118, in schedule_and_build_instances
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager instance_uuids, return_alternates=True)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 718, in _schedule_instances
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager return_alternates=return_alternates)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/utils.py", line 727, in wrapped
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager return func(*args, **kwargs)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager instance_uuids, return_objects, return_alternates)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager return getattr(self.instance, __name)(*args, **kwargs)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 42, in select_destinations
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager instance_uuids, return_objects, return_alternates)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager return cctxt.call(ctxt, 'select_destinations', **msg_args)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 174, in call
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager retry=self.retry)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 131, in _send
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager timeout=timeout, retry=retry)
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
2018-06-12 20:48:10.161 6328 ERROR nova.conductor.manager retry=retry)

Fix proposed to branch: master
Review: https://review.openstack.org/575034

Changed in nova:
assignee: nobody → Gary Kotton (garyk)
status: New → In Progress
Gary Kotton (garyk) wrote :

The scenario is as follows:
1. There is one VC cluster and we have 8K vms running
2. The periodic task puts a huge load on the DB as for each instance the instance object is refreshed and there is a hug burst on the DB
3. In order to tweak the burt we tried to reduce the pool size

This lead to the compute node freezing and the service being down.

jichenjc (jichenjc) on 2018-06-14
Changed in nova:
importance: Undecided → Medium
tags: added: compute
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers