ceph-changed() unnecessarily restarts nova-compute service as many times as there are ceph-mon units

Bug #1835045 reported by Seyeong Kim
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
Fix Released
Undecided
Rodrigo Barbieri

Bug Description

The following code is to be found in the config_changed hook of the nova-compute charm:

    for rid in relation_ids('ceph'):
        for unit in related_units(rid):
            ceph_changed(rid=rid, unit=unit)

This causes the ceph_changed() function to run once per rid.

ceph_changed() calls is_broker_action_done(), which checks for the existence of broker-rsp-UNITNAME-0. ceph_changed() then conditionally restarts nova-compute:

    if is_request_complete(get_ceph_request()):
        log('Request complete')
        # Ensure that nova-compute is restarted since only now can we
        # guarantee that ceph resources are ready, but only if not paused.
        if (not is_unit_paused_set() and
                not is_broker_action_done('nova_compute_restart', rid,
                                          unit)):
            service_restart('nova-compute')
            mark_broker_action_done('nova_compute_restart', rid, unit)
    else:
        send_request_if_needed(get_ceph_request())

If the unit we are processing does not have broker-rsp-UNITNAME-0 set, then is_broker_action_done() will always return False:

rdata = relation_get(rid=rid, unit=unit) or {}
    broker_rsp = rdata.get(get_broker_rsp_key())
    if not broker_rsp:
        return False

Therefore the conditional expression in ceph_changed() will evaluate to True, unless the unit is paused, and restart the nova-compute service.

However only the leader has broker-rsp-UNITNAME-0 set; it is set by the handle_broker_request() function in the ceph-mon charm.

The non-leader units don't have broker-rsp-UNITNAME-0 set, as can be seen below:

ceph-mon/0
{u'auth': u'cephx', u'private-address': u'10.0.0.25', u'broker-rsp-nova-compute-0': u'{"request-id": "551b0a73-9bb4-11e9-9544-525400eb408b", "exit-code": 0}', u'broker-rsp-nova-compute-1': u'{"exit-code": 0, "request-id": "b8d82fa1-9c62-11e9-b901-5254005cde21"}', u'key': u'AQCHhBldMRF3JBAAmIHYqPHRbUJiFlrCdYHBCw==', u'broker_rsp': u'{"exit-code": 0, "request-id": "b8d82fa1-9c62-11e9-b901-5254005cde21"}', u'ceph-public-address': u'10.0.0.25'}

ceph-mon/1
{u'key': u'AQCHhBldMRF3JBAAmIHYqPHRbUJiFlrCdYHBCw==', u'private-address': u'10.0.0.30', u'ceph-public-address': u'10.0.0.30', u'auth': u'cephx'}

..

The upshot is that when jujud-unit-nova-compute-x is restarting, it restarts nova-compute service several times as well, once for each ceph-mon unit, which is undesirable.

I would add the code below to remove those unnecessary restarts, and restart only once - when we process the ceph-mon leader:

1) charm-helpers in hooks/charmhelpers/contrib/storage/linux/ceph.py

def has_broker_rsp(rid=None, unit=None):
    rdata = relation_get(rid=rid, unit=unit) or {}
    broker_rsp = rdata.get(get_broker_rsp_key())
    if not broker_rsp:
        return False
    return True

2) charm-nova-compute in hooks/nova_compute_hooks.py

    if has_broker_rsp(rid, unit):

        if is_request_complete(get_ceph_request()):
            log('Request complete')
            # Ensure that nova-compute is restarted since only now can we
            # guarantee that ceph resources are ready, but only if not paused.
            if (not is_unit_paused_set() and
                    not is_broker_action_done('nova_compute_restart', rid,
                                              unit)):
                service_restart('nova-compute')
                mark_broker_action_done('nova_compute_restart', rid, unit)
        else:
            send_request_if_needed(get_ceph_request())

Tags: sts
Seyeong Kim (seyeongkim)
tags: added: sts
Revision history for this message
Seyeong Kim (seyeongkim) wrote :
description: updated
summary: - Only need to run ceph_changed for leader relation
+ ceph-changed() unnecessarily restarts nova-compute service as many times
+ as there are ceph-mon units
Revision history for this message
Takashi Sogabe (sogabe) wrote :

I think it's better to flatten the logic for the case.
Flat version might be like this.

----
        if not has_broker_rsp(rid, unit):
            return
        if is_request_complete(get_ceph_request()):
            log('Request complete')
            # Ensure that nova-compute is restarted since only now can we
            # guarantee that ceph resources are ready, but only if not paused.
            if (not is_unit_paused_set() and
                    not is_broker_action_done('nova_compute_restart', rid,
                                              unit)):
                service_restart('nova-compute')
                mark_broker_action_done('nova_compute_restart', rid, unit)
        else:
            send_request_if_needed(get_ceph_request())
----

Changed in charm-nova-compute:
assignee: nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670594

Changed in charm-nova-compute:
assignee: Rodrigo Barbieri (rodrigo-barbieri2010) → nobody
Changed in charm-nova-compute:
assignee: nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (master)

Reviewed: https://review.opendev.org/670594
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=69a6fdeaf40087df6a9f03309d363a33daf64b3e
Submitter: Zuul
Branch: master

commit 69a6fdeaf40087df6a9f03309d363a33daf64b3e
Author: Rodrigo Barbieri <email address hidden>
Date: Thu Jul 11 17:44:09 2019 -0300

    Prevent unnecessary nova-compute restarts on ceph_changed

    Nova-compute service restarts make sense only on relation
    config updates received from the ceph-mon leader.

    Also, synced charm helpers as PR #347 is used as
    part of the fix.

    Change-Id: I406f369b1e376db82b8683fd48fbe4de106da16f
    Closes-bug: #1835045

Changed in charm-nova-compute:
status: In Progress → Fix Committed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

This patch breaks new deployments because the broker request never gets sent. This needs to be fixed before 19.07 goes out.

Changed in charm-nova-compute:
milestone: none → 19.07
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675069

Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
James Page (james-page) wrote :

Re-opening bug as revert in process for 19.07 release.

Changed in charm-nova-compute:
status: Fix Committed → New
milestone: 19.07 → 19.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-nova-compute (master)

Reviewed: https://review.opendev.org/675069
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=c0607560ba1b7c98d1aad4351240477973f04202
Submitter: Zuul
Branch: master

commit c0607560ba1b7c98d1aad4351240477973f04202
Author: Edward Hope-Morley <email address hidden>
Date: Wed Aug 7 10:33:27 2019 +0100

    Reverting changes from commit 69a6fde

    This patch reverts the non-charmhelpers changes from
    the patch landed for bug 1835045 that is causing a
    regression whereby new deployments that relate to
    ceph-mon are prevented from sending broker requests
    to do e.g. create the pool needed by
    libvirt-image-backend=rbd.

    Change-Id: I29421ce240a3810d945b76e662a743b4b8497ac8
    Related-Bug: 1835045
    Closes-Bug: 1839297

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (master)

Fix proposed to branch: master
Review: https://review.opendev.org/676407

Changed in charm-nova-compute:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (master)

Reviewed: https://review.opendev.org/676407
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=b1701e1b3174dcb06c05684319d6c73d6b2f509c
Submitter: Zuul
Branch: master

commit b1701e1b3174dcb06c05684319d6c73d6b2f509c
Author: Rodrigo Barbieri <email address hidden>
Date: Thu Jul 11 17:44:09 2019 -0300

    Prevent unnecessary nova-compute restarts on ceph_changed

    Function 'is_broker_action_done' should return False when it
    finds a response from ceph broker not marked done, in order
    to trigger a nova restart. However, it also returns False if
    there is no response data from ceph broker, triggering an
    unecessary restart.

    The function 'ceph_changed' is invoked under different remote
    unit contexts when there are updates to the relation. When
    querying the broker response, only the context of the remote
    unit that is the broker can see the response, unless
    specifically queried for that given unit.

    The 'ceph_changed' invocations under a remote context that
    are not the broker end up returning False in
    'is_broker_action_done' and causing restarts, even after
    the action is already marked done. This also happens on
    'config-changed' hooks.

    To fix this problem, the logic is now changed have each
    'ceph_changed' invocation loop through units and process
    the broker response, regardless of remote context.

    This is an initial change to address the issue locally
    in nova-compute charm. A later change will be worked on
    to move the new helper methods to charmhelpers,
    refactoring the existing ones there.

    Change-Id: I2b41f8b252f4ccb68830e90c5e68456e15372bcf
    Closes-bug: #1835045

Changed in charm-nova-compute:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-compute (stable/19.07)

Fix proposed to branch: stable/19.07
Review: https://review.opendev.org/677518

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-nova-compute (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/677818

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Added pull request for refactor: https://github.com/juju/charm-helpers/pull/363

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-compute (stable/19.07)

Reviewed: https://review.opendev.org/677518
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=0608952687038b208f706329aa6fe22b928b1074
Submitter: Zuul
Branch: stable/19.07

commit 0608952687038b208f706329aa6fe22b928b1074
Author: Rodrigo Barbieri <email address hidden>
Date: Thu Jul 11 17:44:09 2019 -0300

    Prevent unnecessary nova-compute restarts on ceph_changed

    Function 'is_broker_action_done' should return False when it
    finds a response from ceph broker not marked done, in order
    to trigger a nova restart. However, it also returns False if
    there is no response data from ceph broker, triggering an
    unecessary restart.

    The function 'ceph_changed' is invoked under different remote
    unit contexts when there are updates to the relation. When
    querying the broker response, only the context of the remote
    unit that is the broker can see the response, unless
    specifically queried for that given unit.

    The 'ceph_changed' invocations under a remote context that
    are not the broker end up returning False in
    'is_broker_action_done' and causing restarts, even after
    the action is already marked done. This also happens on
    'config-changed' hooks.

    To fix this problem, the logic is now changed have each
    'ceph_changed' invocation loop through units and process
    the broker response, regardless of remote context.

    This is an initial change to address the issue locally
    in nova-compute charm. A later change will be worked on
    to move the new helper methods to charmhelpers,
    refactoring the existing ones there.

    Change-Id: I2b41f8b252f4ccb68830e90c5e68456e15372bcf
    Closes-bug: #1835045
    (cherry picked from commit b1701e1b3174dcb06c05684319d6c73d6b2f509c)

David Ames (thedac)
Changed in charm-nova-compute:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-nova-compute (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-nova-compute/+/677818
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.