All baremetal instance going to ERROR state

Bug #1347795 reported by Derek Higgins
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Critical
Unassigned
OpenStack Compute (nova)
Fix Released
Critical
Paul Murray
tripleo
Fix Released
Critical
Derek Higgins

Bug Description

As of 1300 UTC approce all tripleo CI is failing to bring up instances

looks like the commit that caused it is
https://review.openstack.org/#/c/71557/

just waiting for some CI to finish to confirm.

Tags: ci
Dan Prince (dan-prince)
Changed in nova:
importance: Undecided → Critical
status: New → In Progress
aeva black (tenbrae)
Changed in ironic:
status: New → In Progress
importance: Undecided → Critical
Revision history for this message
aeva black (tenbrae) wrote :

As far as I got triaging this, it seems to stem from this line:

https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L384

   def update_available_resources(self, context):
        ...
        LOG.audit(_("Auditing locally available compute resources"))
        resources = self.driver.get_available_resource(self.nodename)

    def _write_ext_resources(self, resources):
        resources['stats'] = {}
        resources['stats'].update(self.stats)
        self.ext_resources_handler.write_resources(resources)

When a virt driver returns resources containing a "stats" dict, prior to this patch, those stats were exposed to the scheduler and used to inform the ComputeCapabilitiesFilter. After the referenced patch landed, they are ignored, and any stats are overwritten with an empty dict.

This passes the gate because libvirt does not use this mechanism.

Changed in ironic:
status: In Progress → Triaged
Revision history for this message
Michael Still (mikal) wrote :

I am told that there's a patch out there to fix this on the nova side, but its not linked here. Can I have a hint please?

Revision history for this message
Michael Still (mikal) wrote :
Revision history for this message
Paul Murray (pmurray) wrote :

Devananda is absolutely right - the _write_ext_resources() method assumes that stats is not defined and so over-writes it. The fix for this should be to change:

def _write_ext_resources(self, resources):
        resources['stats'] = {}
        ...

to something like:

def _write_ext_resources(self, resources):
        if 'stats' not in resources:
            resources['stats'] = {}
        ...

But I need to verify this is sufficient as the driver seems to pass the stats dict back back already serialized as a json blob.

Revision history for this message
Paul Murray (pmurray) wrote :

On further investigation it seems that the driver stats field was never handled properly. In the periodic task it is passed through without modification (i.e. the Nova resource tracker stats are not added), in other code paths that update the compute_node it is over-written by the Nova resource tracker stats.

I have run up an ironic environment with a number of nodes without the https://review.openstack.org/#/c/71557/ commit and looked at the database. Some reported the driver stats and some the nova stats, but none reported the union of the two.

So the revert may fix the CI, but it will not correct the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/109301

Changed in nova:
assignee: nobody → Paul Murray (pmurray)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/109033
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8ecc07e8f21bddf60fe836f34beab470589918e0
Submitter: Jenkins
Branch: master

commit 8ecc07e8f21bddf60fe836f34beab470589918e0
Author: Derek Higgins <email address hidden>
Date: Wed Jul 23 17:24:18 2014 +0100

    Revert "Add extensible resources to resource tracker"

    This bug added a regression to both nova-bm and ironic,
    neither can deploy instances.
    Fixes-bug: #1347795

    This reverts commit 50b4ba4ee583d25eef10a6608172c002f9bec6f2.

    Change-Id: Icc8d629467911972480b633c7808a0964c9f1c7d

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Paul Murray (pmurray) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Paul Murray (<email address hidden>) on branch: master
Review: https://review.openstack.org/109301
Reason: This has been superseded by the revert: revehttps://review.openstack.org/#/c/109033

Revision history for this message
Marios Andreou (marios-b) wrote :

Hi,

I can confirm that reverting "Add extensible resources to resource tracker" fixed the Ironic boot issues (ironic devstack setup).

Derek Higgins (derekh)
Changed in tripleo:
status: Triaged → Fix Released
assignee: nobody → Derek Higgins (derekh)
aeva black (tenbrae)
Changed in ironic:
status: Triaged → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → juno-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-3 → 2014.2
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-3 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.