Nova services stop to report state via remote conductor

Bug #1517926 reported by Roman Podoliaka
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Roman Podoliaka
Liberty
Fix Released
High
Roman Podoliaka

Bug Description

If _report_state() method (https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L85-L111) of ServiceGroup DB driver fails remotely in nova-conductor, it will effectively break the service state reporting thread (https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/db.py#L54-L57) - this nova service will be considered as 'down' until it's *restarted*.

An example of such remote failure in nova-conductor would be a temporary DB issue, e.g. http://paste.openstack.org/show/479104/

This seems to be a regression introduced in https://github.com/openstack/nova/commit/3bc171202163a3810fdc9bdb3bad600487625443

Changed in nova:
assignee: nobody → Roman Podoliaka (rpodolyaka)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/247552

Changed in nova:
status: New → In Progress
Revision history for this message
György Szombathelyi (gyurco) wrote :

I see this problem not only with the remote conductor, but in the conductor itself.

Changed in nova:
importance: Undecided → High
tags: added: liberty-backport-potential
Changed in nova:
assignee: Roman Podoliaka (rpodolyaka) → John Garbutt (johngarbutt)
Changed in nova:
assignee: John Garbutt (johngarbutt) → Roman Podoliaka (rpodolyaka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/247552
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=49b0d1741c674714fabf24d8409810064b953202
Submitter: Jenkins
Branch: master

commit 49b0d1741c674714fabf24d8409810064b953202
Author: Roman Podoliaka <email address hidden>
Date: Thu Nov 19 16:00:01 2015 +0200

    servicegroup: stop zombie service due to exception

    If an exception is raised out of the _report_state call, we find that
    the service no longer reports any updates to the database, so the
    service is considered dead, thus creating a kind of zombie service.

    I55417a5b91282c69432bb2ab64441c5cea474d31 seems to introduce a
    regression, which leads to nova-* services marked as 'down', if an
    error happens in a remote nova-conductor while processing a state
    report: only Timeout errors are currently handled, but other errors
    are possible, e.g. a DBError (wrapped with RemoteError on RPC
    client side), if a DB temporarily goes away. This unhandled exception
    will effectively break the state reporting thread - service will be
    up again only after restart.

    While the intention of I55417a5b91282c69432bb2ab64441c5cea474d31 was
    to avoid cathing all the possible exceptions, but it looks like we must
    do that to avoid creating a zombie.
    The other part of that change was to ensure that during upgrade, we do
    not spam the log server about MessagingTimeouts while the
    nova-conductors are being restarted. This change ensures that still
    happens.

    Closes-Bug: #1517926

    Change-Id: I44f118f82fbb811b790222face4c74d79795fe21

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/251724

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0b1

This issue was fixed in the openstack/nova 13.0.0.0b1 development milestone.

Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/253224

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/251724
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e0647dd4b2ae9f5f6f908102d2ac447440622785
Submitter: Jenkins
Branch: stable/liberty

commit e0647dd4b2ae9f5f6f908102d2ac447440622785
Author: Roman Podoliaka <email address hidden>
Date: Thu Nov 19 16:00:01 2015 +0200

    servicegroup: stop zombie service due to exception

    If an exception is raised out of the _report_state call, we find that
    the service no longer reports any updates to the database, so the
    service is considered dead, thus creating a kind of zombie service.

    I55417a5b91282c69432bb2ab64441c5cea474d31 seems to introduce a
    regression, which leads to nova-* services marked as 'down', if an
    error happens in a remote nova-conductor while processing a state
    report: only Timeout errors are currently handled, but other errors
    are possible, e.g. a DBError (wrapped with RemoteError on RPC
    client side), if a DB temporarily goes away. This unhandled exception
    will effectively break the state reporting thread - service will be
    up again only after restart.

    While the intention of I55417a5b91282c69432bb2ab64441c5cea474d31 was
    to avoid cathing all the possible exceptions, but it looks like we must
    do that to avoid creating a zombie.
    The other part of that change was to ensure that during upgrade, we do
    not spam the log server about MessagingTimeouts while the
    nova-conductors are being restarted. This change ensures that still
    happens.

    Closes-Bug: #1517926

    Change-Id: I44f118f82fbb811b790222face4c74d79795fe21
    (cherry picked from commit 49b0d1741c674714fabf24d8409810064b953202)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/253224
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=821f644e98475d0af53f621ba13930b3dffc6e7b
Submitter: Jenkins
Branch: stable/liberty

commit 821f644e98475d0af53f621ba13930b3dffc6e7b
Author: Roman Podoliaka <email address hidden>
Date: Thu Dec 3 23:29:13 2015 +0200

    reno: document fixes for service state reporting issues

    Related-Bug: #1505471
    Related-Bug: #1517926

    Change-Id: I480cf1b3b5c6a0ecff274c9a4f6be00e6a94756a

tags: added: in-stable-liberty
Changed in nova:
milestone: none → mitaka-1
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 12.0.1

This issue was fixed in the openstack/nova 12.0.1 release.

Matt Riedemann (mriedem)
tags: added: conductor
removed: liberty-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.