Nova services stop to report state via remote conductor

Bug #1517926 reported by Roman Podoliaka on 2015-11-19
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Roman Podoliaka
Liberty
High
Roman Podoliaka

Bug Description

If _report_state() method (https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L85-L111) of ServiceGroup DB driver fails remotely in nova-conductor, it will effectively break the service state reporting thread (https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L54-L57) - this nova service will be considered as 'down' until it's *restarted*.

An example of such remote failure in nova-conductor would be a temporary DB issue, e.g. http://paste.openstack.org/show/479104/

This seems to be a regression introduced in https://github.com/openstack/nova/commit/3bc171202163a3810fdc9bdb3bad600487625443

Changed in nova:
assignee: nobody → Roman Podoliaka (rpodolyaka)
description: updated

Fix proposed to branch: master
Review: https://review.openstack.org/247552

Changed in nova:
status: New → In Progress
György Szombathelyi (gyurco) wrote :

I see this problem not only with the remote conductor, but in the conductor itself.

Changed in nova:
importance: Undecided → High
tags: added: liberty-backport-potential
Changed in nova:
assignee: Roman Podoliaka (rpodolyaka) → John Garbutt (johngarbutt)
Changed in nova:
assignee: John Garbutt (johngarbutt) → Roman Podoliaka (rpodolyaka)

Reviewed: https://review.openstack.org/247552
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=49b0d1741c674714fabf24d8409810064b953202
Submitter: Jenkins
Branch: master

commit 49b0d1741c674714fabf24d8409810064b953202
Author: Roman Podoliaka <email address hidden>
Date: Thu Nov 19 16:00:01 2015 +0200

    servicegroup: stop zombie service due to exception

    If an exception is raised out of the _report_state call, we find that
    the service no longer reports any updates to the database, so the
    service is considered dead, thus creating a kind of zombie service.

    I55417a5b91282c69432bb2ab64441c5cea474d31 seems to introduce a
    regression, which leads to nova-* services marked as 'down', if an
    error happens in a remote nova-conductor while processing a state
    report: only Timeout errors are currently handled, but other errors
    are possible, e.g. a DBError (wrapped with RemoteError on RPC
    client side), if a DB temporarily goes away. This unhandled exception
    will effectively break the state reporting thread - service will be
    up again only after restart.

    While the intention of I55417a5b91282c69432bb2ab64441c5cea474d31 was
    to avoid cathing all the possible exceptions, but it looks like we must
    do that to avoid creating a zombie.
    The other part of that change was to ensure that during upgrade, we do
    not spam the log server about MessagingTimeouts while the
    nova-conductors are being restarted. This change ensures that still
    happens.

    Closes-Bug: #1517926

    Change-Id: I44f118f82fbb811b790222face4c74d79795fe21

Changed in nova:
status: In Progress → Fix Committed

This issue was fixed in the openstack/nova 13.0.0.0b1 development milestone.

Changed in nova:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/251724
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e0647dd4b2ae9f5f6f908102d2ac447440622785
Submitter: Jenkins
Branch: stable/liberty

commit e0647dd4b2ae9f5f6f908102d2ac447440622785
Author: Roman Podoliaka <email address hidden>
Date: Thu Nov 19 16:00:01 2015 +0200

    servicegroup: stop zombie service due to exception

    If an exception is raised out of the _report_state call, we find that
    the service no longer reports any updates to the database, so the
    service is considered dead, thus creating a kind of zombie service.

    I55417a5b91282c69432bb2ab64441c5cea474d31 seems to introduce a
    regression, which leads to nova-* services marked as 'down', if an
    error happens in a remote nova-conductor while processing a state
    report: only Timeout errors are currently handled, but other errors
    are possible, e.g. a DBError (wrapped with RemoteError on RPC
    client side), if a DB temporarily goes away. This unhandled exception
    will effectively break the state reporting thread - service will be
    up again only after restart.

    While the intention of I55417a5b91282c69432bb2ab64441c5cea474d31 was
    to avoid cathing all the possible exceptions, but it looks like we must
    do that to avoid creating a zombie.
    The other part of that change was to ensure that during upgrade, we do
    not spam the log server about MessagingTimeouts while the
    nova-conductors are being restarted. This change ensures that still
    happens.

    Closes-Bug: #1517926

    Change-Id: I44f118f82fbb811b790222face4c74d79795fe21
    (cherry picked from commit 49b0d1741c674714fabf24d8409810064b953202)

Reviewed: https://review.openstack.org/253224
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=821f644e98475d0af53f621ba13930b3dffc6e7b
Submitter: Jenkins
Branch: stable/liberty

commit 821f644e98475d0af53f621ba13930b3dffc6e7b
Author: Roman Podoliaka <email address hidden>
Date: Thu Dec 3 23:29:13 2015 +0200

    reno: document fixes for service state reporting issues

    Related-Bug: #1505471
    Related-Bug: #1517926

    Change-Id: I480cf1b3b5c6a0ecff274c9a4f6be00e6a94756a

tags: added: in-stable-liberty
Changed in nova:
milestone: none → mitaka-1

This issue was fixed in the openstack/nova 12.0.1 release.

Matt Riedemann (mriedem) on 2016-03-04
tags: added: conductor
removed: liberty-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers