nova-compute errors on startup when ironic user isn't registered in keystone

Bug #1295503 reported by Robert Collins
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Robert Collins

Bug Description

The specific case we saw in testing was when the ironic service user hadn't been created, but in general it makes system startup fragile if we have to formally sequence bringup across a dag.

The specific error was in run_service which has a bunch of code that presumes the full hypervisor functionality is available.

TripleO can probably deploy this today using brute force - o-r-c will just retry every 30s until keystone is up and we've had time to configure the ironic use, but it would be much much better if the needed one-time work was just queued until the api was available.

to reproduce:
 install and configure ironic as usual but give a bad (e.g. missing use) ironic use in the nova config.

Revision history for this message
aeva black (tenbrae) wrote :

I believe this is a feature, not a bug :)

Seriously though, if libvirt isn't available and n-cpu is configured to use that driver, then AFAIK, it doesn't start. This is no different. Nova Compute requires the hypervisor driver be available during init_host(), or else critical initialization tasks (which are only performed during init_host) can not be performed.

Revision history for this message
aeva black (tenbrae) wrote :

After looking further in the code and chatting with Chris Behrens about this, I would like to offer the following summary:

- nova.compute.manager.init_host() calls driver.list_instances(). This needs to happen during start of n-cpu.
- nova.virt.ironic.driver includes a _retry_if_service_is_unavailable() method to retry if the ir-api service is not available
- the problem you have is actually that the "ironic" service user in _keystone_ is not yet created at this point, and the error being raised is not HTTPServiceUnavailable, so it's not getting retried.

Robert, your bug report didn't include the actual exception class being raised -- could you attach that? While I think that bringing the n-cpu and ironic services online before their keystone service accounts are created is a bug in tripleo's tooling, I also think it's reasonable for ironic.driver to retry on any transitory failure. I'm fine adding this to the list of exceptions for which it retries.

Changed in ironic:
status: New → Incomplete
importance: Undecided → Medium
Revision history for this message
aeva black (tenbrae) wrote :

For reference, follow-on bug after looking into the retry code:
  https://bugs.launchpad.net/ironic/+bug/1295870

Revision history for this message
Robert Collins (lifeless) wrote :

So, this may be a feature, but we can't deploy Ironic with heat as-is.

"Unauthorized: The request you have made requires authentication. (HTTP 401)"

is the error, which is 'raise exceptions.from_response' in keystoneclient/session.py.

Here is the sequence of a deploy.

1) we create N nova instances which have local scripts to handle local initialisation.
2) we wait until *all* the instances have signalled 'ready' via their wait condition. This happens when os-collect-config completes without error.
3) We initialize keystone and all other centrally managed API based services (e.g. nova flavors, aggregates etc etc).

Why? Firstly there isn't a robust way to say 'run this thing on *just one host* in a heat cluster today, which is why the keystone initialistion is done from *outside*. Secondly, having admin keystone credentials spread out amongst the cluster is undesirable, so we try and minimise where we need privileged access - and keystone front ends definitely are not that.

We might be able to work around this by making ironic run *after* the signal that the node is ready, buts thats super ugly...

summary: - ironic nova driver blocks nova-compute startup when ironic isn't
- available
+ nova-compute errors on startup with ironic nova when ironic user isn't
+ registered in keystone
aeva black (tenbrae)
Changed in ironic:
status: Incomplete → Triaged
summary: - nova-compute errors on startup with ironic nova when ironic user isn't
- registered in keystone
+ nova-compute errors on startup when ironic user isn't registered in
+ keystone
Revision history for this message
aeva black (tenbrae) wrote :

Lengthy discussion in IRC and on etherpad:
  https://etherpad.openstack.org/p/ironic-nova-friction

Summary:
- nova compute process configured to use nova.virt.ironic driver
- during nova.compute.manager:init_host, several driver methods are invoked
- if the "ironic" service account has not been created in keystone (or is misconfigured), an exception is raised and nova-compute fails to start

Additionally, nova.compute.manager:init_host() and _init_instance() are not strictly necessary when the hypervisor driver is nova.virt.ironic. Some actions performed therein are completely unnecessary (eg, those related to instance migration) and some are nice-but-not-necessary (eg, resuming a failed delete). Even the nice-but-not-necessary become confusing (at best) or harmful (at worst) if multiple nova-computes start up and advertise the same list of Ironic instances.

Since there is no mapping of nova compute host :: ironic node, all nova compute hosts should use the same hostname and will see the complete list of nodes and instances. The proposed solution to this is to override _init_instance and init_host.

Changed in ironic:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/82637

Changed in ironic:
assignee: nobody → Robert Collins (lifeless)
status: Triaged → In Progress
aeva black (tenbrae)
Changed in ironic:
milestone: none → icehouse-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/82637
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=a719472d948617939bc9374f8aa5b488ff76c478
Submitter: Jenkins
Branch: master

commit a719472d948617939bc9374f8aa5b488ff76c478
Author: Robert Collins <email address hidden>
Date: Tue Mar 25 11:28:03 2014 +1300

    Provide a new ComputeManager for Ironic

    Ironic is internally clustered and that causes some friction
    with Nova which assumes that each n-cpu instance is the only
    one responsible for any given VM. While this is going to be
    addressed in the long term, at the moment we need to make
    it possible to run in an HA setup for Ironic.

    The short term solution proposed is to run 2+ nova-compute's
    each of which reports the same hostname (e.g. ironic). This
    works but has some caveats - one of which is that _init_instance
    will now race between different nova-compute instances starting
    up at the same time. A custom ComputeManager permits us to address
    that without prejuidice to future long term solutions.

    Relatedly, TripleO needs Ironic to permit service startup before
    keystone is initialised, which the removal of API calls during
    init_host permits - and as there are no API calls needed for
    correct behaviour of the Ironic driver, this is straight forward :).

    See https://etherpad.openstack.org/p/ironic-nova-friction for
    more discussion.

    Change-Id: I68d46c4da8715df03c3a88393b55665dc57045a3
    Closes-Bug: #1295503

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.