OpenStack Compute (nova)

Ironic computes may not be discovered when node count is less than compute count

Bug #1755602 reported by Dan Smith on 2018-03-13

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
OpenStack Compute (nova)	Fix Released	Medium	Dan Smith
Pike	Fix Released	Medium	Dan Smith
Queens	Fix Released	Medium	Dan Smith
tripleo	Fix Released	Medium	Oliver Walsh	tripleo rocky-rc1

Bug Description

In an ironic deployment being built from day zero, there is an ordering problem, which generates a race condition for operators. Consider this common example:

At config time, you create and start three nova-compute services pointing at your ironic deployment. These three will be HA using the ironic driver's hash ring functionality. At config time, there are no ironic nodes present yet, which means running discover_hosts will create no host mappings.

Next, a single ironic node is added, which is owned by one of the computes per the hash rules. At this point, you can run discover_hosts and whatever compute owns that node will get a host mapping. Then you add a second ironic node, which causes all three nova-computes to rebalance the hash ring. One or more of the ironic nodes will definitely land on one of the other nova-computes and will suddenly be unreachable because there is no host mapping until the next time discover_hosts is run. Since we track the "mapped" bit on compute nodes, and compute nodes move between hosts with ironic, we won't even notice that the new owner nova-compute needs a host mapping. In fact, we won't notice until we get lucky enough to land a never-mapped ironic node on a nova-compute for the first time and then run discover_hosts after that point.

For an automated config management system, this is a lot of complexity to handle in order to generate a stable output of a working system. In many cases where you're using ironic to bootstrap another deployment (i.e. tripleo) the number of nodes may be small (less than the computes) for quite some time.

There are a couple obvious options I see:

1. Add a --and-services flag to nova-manage, which will also look for all nova-compute services in the cell and make sure those have mappings. This is ideal because we could get all services mapped at config time without even having to have an ironic node in place yet (which is not possible today). We can't do this efficiently right away because nova.services does not have a mapped flag, and thus the scheduler periodic should _not_ include services.

2. We could unset compute_node.mapped any time we re-home an ironic node to a different nova-compute. This would cause our scheduler periodic to notice the change and create a host mapping if it happens to move to an unmapped nova-compute. This generates extra work during normal operating state and also still leaves us with an interval of time where a previously-usable ironic node becomes unusable until the host discovery periodic task runs again.

IMHO, we should do #1. It's a backportable change, and it's actually a better workflow for config automation tools than what we have today, even discounting this race. We can do what we did before, which is do it once for backports, and then add a mapped bit in master to make it more efficient, allowing it to be included in the scheduler periodic task.

Tags:

Revision history for this message

Dan Smith (danms) wrote on 2018-03-13:

See this RHBZ for reproducer instructions: https://bugzilla.redhat.com/show_bug.cgi?id=1554460

Changed in nova:
importance:	Undecided → Medium
status:	New → Confirmed
assignee:	nobody → Dan Smith (danms)
tags:	added: cells

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-13: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/552691

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Dan Smith (danms) wrote on 2018-03-13:

To expand on the "better workflow than we have today":

Right now, when nova puppet or ansible modules are running to set up nova-compute, they run discover_hosts after starting the service. However, if using ironic, and ironic isn't running, configured, or populated with hosts, then discover_hosts will not create any host mappings. That means that you can't create an empty nova deployment properly configured, pointing to ironic, and then start enrolling ironic nodes. It also means that you must run discover_hosts after adding ironic nodes, which is mixing nova and ironic tasks, and has been the source of much complaining from people using ironic.

If we allow discovery by service, then you can set up nova fully during the nova phase, and then move on to ironic without having to come back. A much better workflow.

Revision history for this message

Dmitry Tantsur (divius) wrote on 2018-03-14:

> If we allow discovery by service, then you can set up nova fully during the nova phase, and then move on to ironic without having to come back. A much better workflow.

You've sold it to me :)

> Add a --and-services flag to nova-manage, which will also look for all nova-compute services in the cell and make sure those have mappings.

Why not do it by default? Are there any problems with it in a virtual deployment? I'm worried about mixed bm-vm deployments (and I don't like special-casing ironic deployments).

Revision history for this message

Dan Smith (danms) wrote on 2018-03-14:

We could, but we'd need to make it more efficient like we did for compute nodes by keeping a mapped bit on the object. For something backportable, we should just do the flag for now.

Going forward, it makes less sense to do this based on services for vm, because compute nodes can't float between services, so lacking a compute node means the service can't be scheduled to anyway. I'm not sure how I feel about making this service-focused by default, but I'll think about it and we can handle that after we at least make it possible I think.

Revision history for this message

melanie witt (melwitt) wrote on 2018-03-14:

I don't like the special-casing in general but am in agreement with doing it by flag for the backportable solution.

I'd prefer it if we don't have to "if ironic, use --by-service" in the forward-looking solution for the same reasons Dmitry pointed out. If we're going to add a mapped bit on the service object anyway, why not just change to being service-focused and deprecate the compute node mapped bit instead of having two ways of doing it? Maybe I'm missing something as I'm not that familiar with the automatic host discovery.

Oliver Walsh (owalsh) on 2018-03-16

Changed in tripleo:
assignee:	nobody → Oliver Walsh (owalsh)
milestone:	none → rocky-1
status:	New → In Progress
importance:	Undecided → Medium

Revision history for this message

Oliver Walsh (owalsh) wrote on 2018-03-16:

For tripleo at least, the periodic job (which is a special case only for ironic) is no longer necessary with this flag. I can simply run once with --by-service once all VM computes and ironic computes are deployed and services have registered.

Revision history for this message

melanie witt (melwitt) wrote on 2018-03-16:

Thanks for the input.

I think what I meant was, the operator has to know "if I have Ironic, I have to use --by-service, else plain discover_hosts is fine." And if we're going to add a "mapped" bit to the Service object going forward anyway to make --by-service efficient, why not just keep the one efficient way of doing it and not give operators an option where they have to know "if I have Ironic, I should use --by-service" and keep the two different efficient code paths instead of dropping one and keeping one efficient code path. (Again, this is all regarding the forward-looking solution, not the backportable solution we're doing as a first step.)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-19: Fix merged to nova (master)

Reviewed: https://review.openstack.org/552691
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=005a66d7e0bb716e32d29a6b5c9d9f24192596e2
Submitter: Zuul
Branch: master

commit 005a66d7e0bb716e32d29a6b5c9d9f24192596e2
Author: Dan Smith <email address hidden>
Date: Tue Mar 13 14:42:09 2018 -0700

Add --by-service to discover_hosts

    This allows us to discover and map compute hosts by service instead of
    by compute node, which will solve a major deployment ordering problem for
    people using ironic. This also allows closing a really nasty race when
    doing HA of nova-compute/ironic.

Change-Id: Ie9f064cb9caf6dcba2414acb24d12b825df45fab
Closes-Bug: #1755602

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-20: Fix proposed to nova (stable/queens)

#10

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/554600

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-20: Fix proposed to nova (stable/pike)

#11

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/554603

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-28: Fix merged to nova (stable/queens)

#12

Reviewed: https://review.openstack.org/554600
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dd90b5dbf31ba28ff3e6d54182a8badac4548313
Submitter: Zuul
Branch: stable/queens

commit dd90b5dbf31ba28ff3e6d54182a8badac4548313
Author: Dan Smith <email address hidden>
Date: Tue Mar 13 14:42:09 2018 -0700

Add --by-service to discover_hosts

Conflicts:
doc/source/cli/nova-manage.rst

NOTE(danms): Conflicts in nova-manage.rst due to queens not having recent
changes to the man page.

    Change-Id: Ie9f064cb9caf6dcba2414acb24d12b825df45fab
    Closes-Bug: #1755602
    (cherry picked from commit 005a66d7e0bb716e32d29a6b5c9d9f24192596e2)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-30: Fix merged to nova (stable/pike)

#13

Reviewed: https://review.openstack.org/554603
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=76b9cbd12b1fcc3d92fd4b284a90c4238df5c485
Submitter: Zuul
Branch: stable/pike

commit 76b9cbd12b1fcc3d92fd4b284a90c4238df5c485
Author: Dan Smith <email address hidden>
Date: Tue Mar 13 14:42:09 2018 -0700

Add --by-service to discover_hosts

Conflicts:
doc/source/cli/nova-manage.rst

NOTE(danms): Conflicts in nova-manage.rst due to queens not having recent
changes to the man page.

    Change-Id: Ie9f064cb9caf6dcba2414acb24d12b825df45fab
    Closes-Bug: #1755602
    (cherry picked from commit 005a66d7e0bb716e32d29a6b5c9d9f24192596e2)
    (cherry picked from commit dd90b5dbf31ba28ff3e6d54182a8badac4548313)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-02: Fix included in openstack/nova 17.0.2

#14

This issue was fixed in the openstack/nova 17.0.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-02: Fix included in openstack/nova 16.1.1

#15

This issue was fixed in the openstack/nova 16.1.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-20: Fix included in openstack/nova 18.0.0.0b1

#16

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Alex Schultz (alex-schultz) on 2018-04-20

Changed in tripleo:
milestone:	rocky-1 → rocky-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-21: Related fix merged to tripleo-heat-templates (master)

#17

Reviewed: https://review.openstack.org/553376
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9d61779a231924588d5a67d85d006fa4111215a8
Submitter: Zuul
Branch: master

commit 9d61779a231924588d5a67d85d006fa4111215a8
Author: Oliver Walsh <email address hidden>
Date: Thu Mar 15 12:30:51 2018 +0000

Improve nova-ironic cellv2 discovery

    Use the existing nova-compute cellv2 discovery logic for nova-ironic too, now
    that we have the --by-service flag.
    The nova_api_discover_hosts.sh script will now wait (up to 10 minutes) for all
    nova-compute and nova-ironic services to register, then run host discovery
    with --by-service to create host mappings for all services. We no longer need
    ironic nodes to be deployed on the nova-ironic services for discovery to work.
    We also no longer need to enable the priodic job.

Related nova change Ie9f064cb9caf6dcba2414acb24d12b825df45fab

Related-Bug: #1755602
Change-Id: I723237ae7285f3babd6eceb1ce7da4e2734d1e4f

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-23: Related fix proposed to tripleo-heat-templates (stable/queens)

#18

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/563522

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-26: Related fix merged to tripleo-heat-templates (stable/queens)

#19

Reviewed: https://review.openstack.org/563522
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=18cf98136437a9aec5cf1adcb36734c186d42185
Submitter: Zuul
Branch: stable/queens

commit 18cf98136437a9aec5cf1adcb36734c186d42185
Author: Oliver Walsh <email address hidden>
Date: Thu Mar 15 12:30:51 2018 +0000

Improve nova-ironic cellv2 discovery

Related nova change Ie9f064cb9caf6dcba2414acb24d12b825df45fab

    Related-Bug: #1755602
    Change-Id: I723237ae7285f3babd6eceb1ce7da4e2734d1e4f
    (cherry picked from commit 9d61779a231924588d5a67d85d006fa4111215a8)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-26: Related fix proposed to tripleo-heat-templates (stable/pike)

#20

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/564497

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-28: Related fix merged to tripleo-heat-templates (stable/pike)

#21

Reviewed: https://review.openstack.org/564497
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=56491c13b6da425f70d23f77cae2cf80630c2f4e
Submitter: Zuul
Branch: stable/pike

commit 56491c13b6da425f70d23f77cae2cf80630c2f4e
Author: Oliver Walsh <email address hidden>
Date: Thu Mar 15 12:30:51 2018 +0000

Improve nova-ironic cellv2 discovery

Related nova change Ie9f064cb9caf6dcba2414acb24d12b825df45fab

Conflicts:
environments/services/ironic.yaml

    Related-Bug: #1755602
    Change-Id: I723237ae7285f3babd6eceb1ce7da4e2734d1e4f
    (cherry picked from commit 9d61779a231924588d5a67d85d006fa4111215a8)

tags:

added: in-stable-pike

Emilien Macchi (emilienm) on 2018-06-05

Changed in tripleo:
milestone:	rocky-2 → rocky-3

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-rc1 → stein-1

Oliver Walsh (owalsh) on 2018-07-26

Changed in tripleo:
status:	In Progress → Fix Released

Emilien Macchi (emilienm) on 2018-07-31

Changed in tripleo:
milestone:	stein-1 → rocky-rc1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-10: Related fix proposed to nova (master)

#22

Related fix proposed to branch: master
Review: https://review.openstack.org/609345

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-10:

#23

Related fix proposed to branch: master
Review: https://review.openstack.org/609346

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-10: Change abandoned on nova (master)

#24

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.openstack.org/609345

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-10:

#25

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.openstack.org/609346

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1554460
[CLOSED ERRATA] Edit

Bug watches keep track of this bug in other bug trackers.