Server booted to a server group stucks in scheduling state if the policy filter is not configured

Bug #1408326 reported by Balazs Gibizer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Ildiko Vancsa

Bug Description

To reproduce the problem:

Remove ServerGroupAntiAffinity from scheduler_default_filters in nova.conf

$ nova server-group-create my-anti-affinity anti-affinity
$ nova boot --flavor 42 --image 1b0ef685-2a7a-4c11-8e2d-a549215c1b3a --hint group=<uuid of the group created above> vm23

The following exception is visible in the conductor log:
2015-01-07 15:33:44.988 ERROR nova.scheduler.utils [req-675e5f28-59c4-4406-952c-182202aae805 admin admin] ServerGroupAntiAffinityFilter not configured
2015-01-07 15:33:44.988 ERROR oslo.messaging.rpc.dispatcher [req-675e5f28-59c4-4406-952c-182202aae805 admin admin] Exception during message handling: No valid host was found. ServerGroupAntiAffinityFilter not configured
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 137, in _dispatch_and_reply
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 180, in _dispatch
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 126, in _do_dispatch
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/conductor/manager.py", line 626, in build_instances
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher filter_properties)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/scheduler/utils.py", line 311, in setup_instance_group
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher group_info = _get_group_details(context, instance_uuid, group_hosts)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/scheduler/utils.py", line 291, in _get_group_details
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher raise exception.NoValidHost(reason=msg)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher NoValidHost: No valid host was found. ServerGroupAntiAffinityFilter not configured
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher

But the server stuck in scheduling state for ever.
$ nova list
+--------------------------------------+------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------+--------+------------+-------------+--------------------+
| 4b919b5c-1db4-4705-a608-8ce8db862b07 | vm23 | BUILD | scheduling | NOSTATE | |
+--------------------------------------+------+--------+------------+-------------+--------------------+

Tags: scheduler
Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
Revision history for this message
Pasquale Porreca (pasquale-porreca) wrote :

Can you share your nova.conf (at least the part relative to Filter Scheduler)?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

The only thing I changed from the default devstack nova.conf is that I removed the ServerGroupAntiAffinityFilter form the scheduler_default_filters to trigger the code path that rejects the scheduling of a server to a group with anti-affinity policy as the
filter is not available.

So basically I added the following line to the default devstack nova.conf:
scheduler_default_filters = RetryFilter, AvailabilityZoneFilter, RamFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAffinityFilter

I checked the conductor code and the not handled exception is coming from here https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L625 and the solution is straightforward. Basically we have to move the scheduler_utils.setup_instance_group() call to the try - except block below.

I'm working on a patch that will fi the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/145761

Changed in nova:
status: New → In Progress
Claudiu Belu (cbelu)
tags: added: scheduler
Revision history for this message
Claudiu Belu (cbelu) wrote :

Is this bug also present in Juno? If so, can you add the appropiate juno-backport-potential? Thanks.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I will test it on Juno to see if it needs to be backported.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

This bug is not present in Juno (or Icehouse) as the problem is introduced in this change https://review.openstack.org/#/c/128058/ and that change merged in Kilo-1 only. Therefore no backporting effort is needed.

Changed in nova:
assignee: Balazs Gibizer (balazs-gibizer) → Ildiko Vancsa (ildiko-vancsa)
Revision history for this message
Jay Pipes (jaypipes) wrote :

IMO, the bug is that setup_instance_group() is raising NoValidHost when the anti-affinity filter is not enabled. The exception raised should be more descriptive of the problem... for example something like FilterConfigurationError or UnsupportedPolicyException.

Changed in nova:
importance: Undecided → Low
Revision history for this message
Jay Pipes (jaypipes) wrote :

Setting to Low because this is partly a configuration issue, and not something that is common to run into.

Changed in nova:
assignee: Ildiko Vancsa (ildiko-vancsa) → Balazs Gibizer (balazs-gibizer)
Changed in nova:
assignee: Balazs Gibizer (balazs-gibizer) → Ildiko Vancsa (ildiko-vancsa)
Changed in nova:
assignee: Ildiko Vancsa (ildiko-vancsa) → Balazs Gibizer (balazs-gibizer)
Changed in nova:
assignee: Balazs Gibizer (balazs-gibizer) → Ildiko Vancsa (ildiko-vancsa)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/145761
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=28aedc604d993be4b1b38822fd08500df94b6ce3
Submitter: Jenkins
Branch: master

commit 28aedc604d993be4b1b38822fd08500df94b6ce3
Author: Balazs Gibizer <email address hidden>
Date: Wed Jan 7 16:01:38 2015 +0100

    Fix leaking exceptions from scheduler utils

    scheduler_utils.setup_instance_group() can raise NoValidHost
    exception and the conductor does not handle that properly which
    causes the server to get stuck in a scheduling state instead of
    going to ERROR.

    The function is called several times in conductor/manager.py. This
    patch moves the function calls in the try blocks already present.
    It handles the missing Affinity filters as UnsupportedPolicyException,
    which is a new exception type added here. This way the exception
    raised by scheduler_utils.setup_instance_group() will cause the
    VM to end up in ERROR state instead of NOSTATE, as it is expected.
    Quota rollback is also handled properly, when it is needed.

    Closes-bug: #1408326
    Change-Id: I3a8ba37ded8ac3fa9b5f3f35b09feb82d822d583

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-2 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.