ironic: n-cpu fails to recover after losing connection to ironic-api and placement-api

Bug #1750450 reported by Jim Rollenhagen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Jim Rollenhagen

Bug Description

The ironic virt driver does some crazy things when the ironic API goes down - it returns [] from get_available_nodes(). When the resource tracker sees this, it immediately attempts to delete all of the compute node records and resource providers for said nodes.

If placement is also down at this time, the resource providers will not be properly deleted.

When ironic-api and placement-api return, nova will see nodes, create compute_node records for them, and try to create new resource providers (as they are new compute_node records). This will fail with a name conflict, and the nodes will be unusable.

This is easy to fix, by raising an exception in get_available_nodes, instead of lying to the resource tracker and returning []. However, this causes nova-compute to fail to start if ironic-api is not available.

This may be fine but should have a larger discussion. We've added these hacks over the years for some reason, we should look at the bigger picture and decide how we want to handle these cases.

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

There is a patch in progress for this: https://review.openstack.org/#/c/545479/

tags: added: ironic placement
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Well, I don't like the proposed solution to provide an exception because it would create a dependency between Nova and Ironic.

That said, we say that Placement API should be run *before* the computes, so having it down is not really supported. Maybe we should be discussing about how Placement could accept recreating an existing RP, but yeah, agreed, let's discuss that.

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

We actually discussed at the PTG about the possible solutions and it looks like that eventually the best way to tackle that bug is to try/catch the exception and raise a proper NotReadyYet exception that Nova can handle.

Changed in nova:
status: Confirmed → Triaged
Revision history for this message
Chris Dent (cdent) wrote :

"NotReadyYet" handling seems sensible.

While we shouldn't expect Placement to be down, we need to be resilient when it is. This will be increasingly necessary as the edge people start doing crazy^wfun stuff.

Changed in nova:
assignee: nobody → Jim Rollenhagen (jim-rollenhagen)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/545479
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=acab8b0067b9ac90ed8c27daf04cfb4f926aa41a
Submitter: Zuul
Branch: master

commit acab8b0067b9ac90ed8c27daf04cfb4f926aa41a
Author: Jim Rollenhagen <email address hidden>
Date: Fri Mar 16 16:33:20 2018 +0000

    ironic: stop lying to the RT when ironic is down

    Returning an empty list of nodes can cause all sorts of crazy behavior,
    so we instead bubble up a VirtDriverNotReady exception, which the compute
    manager will ignore.

    Change-Id: Ib0ec1012b74e9a9e74c8879f3feed5f9332b711f
    Related-Bug: #1744139
    Closes-Bug: #1750450

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/575628

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/575628
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=274739f6c73b190c541caa63486390efa1a3d19b
Submitter: Zuul
Branch: stable/queens

commit 274739f6c73b190c541caa63486390efa1a3d19b
Author: Jim Rollenhagen <email address hidden>
Date: Fri Mar 16 16:33:20 2018 +0000

    ironic: stop lying to the RT when ironic is down

    Returning an empty list of nodes can cause all sorts of crazy behavior,
    so we instead bubble up a VirtDriverNotReady exception, which the compute
    manager will ignore.

    Change-Id: Ib0ec1012b74e9a9e74c8879f3feed5f9332b711f
    Related-Bug: #1744139
    Closes-Bug: #1750450
    (cherry picked from commit acab8b0067b9ac90ed8c27daf04cfb4f926aa41a)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.6

This issue was fixed in the openstack/nova 17.0.6 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.