Comment 3 for bug 1517926

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/247552
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=49b0d1741c674714fabf24d8409810064b953202
Submitter: Jenkins
Branch: master

commit 49b0d1741c674714fabf24d8409810064b953202
Author: Roman Podoliaka <email address hidden>
Date: Thu Nov 19 16:00:01 2015 +0200

    servicegroup: stop zombie service due to exception

    If an exception is raised out of the _report_state call, we find that
    the service no longer reports any updates to the database, so the
    service is considered dead, thus creating a kind of zombie service.

    I55417a5b91282c69432bb2ab64441c5cea474d31 seems to introduce a
    regression, which leads to nova-* services marked as 'down', if an
    error happens in a remote nova-conductor while processing a state
    report: only Timeout errors are currently handled, but other errors
    are possible, e.g. a DBError (wrapped with RemoteError on RPC
    client side), if a DB temporarily goes away. This unhandled exception
    will effectively break the state reporting thread - service will be
    up again only after restart.

    While the intention of I55417a5b91282c69432bb2ab64441c5cea474d31 was
    to avoid cathing all the possible exceptions, but it looks like we must
    do that to avoid creating a zombie.
    The other part of that change was to ensure that during upgrade, we do
    not spam the log server about MessagingTimeouts while the
    nova-conductors are being restarted. This change ensures that still
    happens.

    Closes-Bug: #1517926

    Change-Id: I44f118f82fbb811b790222face4c74d79795fe21