ironic tempest tests periodically failing: No valid host was found

Bug #1398128 reported by Adam Gandelman
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Invalid
Undecided
Unassigned
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
devstack
Fix Released
Undecided
Unassigned

Bug Description

This was noticed on the stable/juno ironic sideways grenade jobs, but is also confirmed to be happening on the check-tempest-dsvm-ironic-parallel-nv job, which runs a similarly configured tempest run against Ironic:

http://logs.openstack.org/84/137684/1/check/check-grenade-dsvm-ironic-sideways/6d118bc/

A number of the early compute tests will fail to spawn an instance, getting a scheduling error on the client side:

BuildErrorException: Server %(server_id)s failed to build and is in ERROR status
Details: Server eb81ee40-ceba-484d-b665-92ec3bf4fedd failed to build and is in ERROR status
Details: {u'message': u'No valid host was found. ', u'created': u'2014-11-27T17:44:05Z', u'code': 500}

Looking through the nova logs, the request never even makes to the nova-scheduler. The last error is reported in conductor:

2014-11-27 17:44:01.005 WARNING nova.scheduler.driver [req-a3c046e5-66db-4bca-a6f8-2263763e49a6 SecurityGroupsTestJSON-2119055496 SecurityGroupsTestJSON-1381566740] [instance: 9008811a-f400-42ae-98d5-caf828fa34dc] NoValidHost exception with message: 'No valid host was found.'

Looking at the time stamps of the requests, the first instance is requested at 17:44:00

2014-11-27 17:44:00.944 24730 DEBUG tempest.common.rest_client [req-a3c046e5-66db-4bca-a6f8-2263763e49a6 None] Request (SecurityGroupsTestJSON:test_server_security_groups): 202 POST http://127.0.0.1:8774/v2/adf4838f0d15462da4601a5d853eafbf/servers 0.515s

However, on the nova-compute side, the resource tracker has not been updated to include the enlisted Ironic nodes until much later. This first time the tracker contains any of the ironic resources is at 17:44:06:

2014-11-27 17:44:06.224 21645 AUDIT nova.compute.resource_tracker [-] Total physical ram (MB): 512, total allocated virtual ram (MB): 0

So there's a race between the resource tracker's initial inclusion of available resources and Tempest running the first set of tests that require an instance. This can be worked around in a couple of ways:

* Adjust the periodic task interval on nova-compute to update much more frequently, tho this will just narrow the window.
* Have tempest run an admin 'nova hypervisor-stats' call on the client side and wait for resources before running any instances (in the case of baremetal only)
* Adjust devstack's nova cpu deployment to spin until hypervisor-stats reflect the ironic node parameters

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138158

Changed in devstack:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/138164

Revision history for this message
Alan Pevec (apevec) wrote :

Workaround fix was merged to Grenade stable/juno branch https://review.openstack.org/138170

no longer affects: grenade
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/138158
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=c78e4671098f08c2eaf0b033777d3c01082141cd
Submitter: Jenkins
Branch: master

commit c78e4671098f08c2eaf0b033777d3c01082141cd
Author: Adam Gandelman <email address hidden>
Date: Mon Dec 1 11:24:37 2014 -0800

    Wait for hypervisor-stats to reflect ironic nodes

    When enrolling nodes into Ironic, poll nova's hypervisor-stats until
    the hypervisor count reflects the number of enrolled nodes. This
    eliminates a race where devstack completes and an instance is spawned
    before the first post-enrollment periodic task ticks on the Nova side,
    which has recently started popping up in the gate.

    Change-Id: Ib3d8005e0094ee8af2d5fcb65aca6cd92736da90
    Closes-bug: #1398128

Changed in devstack:
status: In Progress → Fix Released
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiTm92YSBoeXBlcnZpc29yLXN0YXRzIGRpZCBub3QgcmVnaXN0ZXIgYXQgbGVhc3QgMyBub2Rlc1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI0MzIwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MTc3ODQzODI2NDZ9

As the bugfix is not fixing the race condition (and that's not trivial, due to how RT provides resources to the scheduler), it creates a side-effect issue.

Maybe we could try to improve the number of retries ? That said, 120 seconds for booting 3 nodes is still big IMHO.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (stable/juno)

Reviewed: https://review.openstack.org/138164
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=ab0a04735f8b54783d6bb2614ac5f2e22e6836bf
Submitter: Jenkins
Branch: stable/juno

commit ab0a04735f8b54783d6bb2614ac5f2e22e6836bf
Author: Adam Gandelman <email address hidden>
Date: Mon Dec 1 11:24:37 2014 -0800

    Wait for hypervisor-stats to reflect ironic nodes

    When enrolling nodes into Ironic, poll nova's hypervisor-stats until
    the hypervisor count reflects the number of enrolled nodes. This
    eliminates a race where devstack completes and an instance is spawned
    before the first post-enrollment periodic task ticks on the Nova side,
    which has recently started popping up in the gate.

    Conflicts:
     lib/ironic

    Change-Id: Ib3d8005e0094ee8af2d5fcb65aca6cd92736da90
    Closes-bug: #1398128
    (cherry picked from commit c78e4671098f08c2eaf0b033777d3c01082141cd)

tags: added: in-stable-juno
Revision history for this message
Dmitry Tantsur (divius) wrote :

Hi Adam! I've seen this issue as well. Can something be done on Ironic side?

Alan Pevec (apevec)
tags: removed: in-stable-juno
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

So the devstack patch solves the issue in the grenade migration, where we start n-cpu early and don't enroll nodes till much later. But this is error still showing up in the check-tempest-dsvm-ironic-parallel-nv job, frequently. Symptoms, via http://logs.openstack.org/87/139687/1/check/check-tempest-dsvm-ironic-parallel-nv/55ce684/logs/:

Test case fails with the following on the tempest/client side:

Details: {u'code': 500, u'created': u'2014-12-05T18:02:26Z', u'message': u'No valid host was found. There are not enough hosts available.'}

On deployment, devstack waits till nova reports 3 nodes in the hypervisor stats, finishes:

2014-12-05 18:01:27.371 | ++ '[' 3 -ge 3 ']'

n-cpu's first syncs of the 3 nodes, ironic reports each node's node properties, but for all updates nova registers 0mb/0gb/0cpu:

http://logs.openstack.org/87/139687/1/check/check-tempest-dsvm-ironic-parallel-nv/55ce684/logs/screen-n-cpu.txt.gz#_2014-12-05_18_01_26_804

2014-12-05 18:01:26.904 30461 INFO nova.compute.resource_tracker [-] Compute_service record created for devstack-trusty-hpcloud-b3-3341906:7048e6e1-e180-4295-8e0a-81eb0416a10a
2014-12-05 18:01:26.999 30461 INFO nova.compute.resource_tracker [-] Compute_service record created for devstack-trusty-hpcloud-b3-3341906:a4768094-3818-4e04-b55f-6ec7b34ffce3
2014-12-05 18:01:27.088 30461 INFO nova.compute.resource_tracker [-] Compute_service record created for devstack-trusty-hpcloud-b3-3341906:75fe1fde-6f4e-4b31-8b7c-7683d96e9b88

The next set of periodic tasks run 1min later and this time, resources for the nodes are updated appropriately, 512mb/1gb/1cpu:

http://logs.openstack.org/87/139687/1/check/check-tempest-dsvm-ironic-parallel-nv/55ce684/logs/screen-n-cpu.txt.gz#_2014-12-05_18_02_27_875

2014-12-05 18:02:27.958 30461 INFO nova.compute.resource_tracker [-] Compute_service record updated for devstack-trusty-hpcloud-b3-3341906:7048e6e1-e180-4295-8e0a-81eb0416a10a
2014-12-05 18:02:28.077 30461 INFO nova.compute.resource_tracker [-] Compute_service record updated for devstack-trusty-hpcloud-b3-3341906:a4768094-3818-4e04-b55f-6ec7b34ffce3
2014-12-05 18:02:28.162 30461 INFO nova.compute.resource_tracker [-] Compute_service record updated for devstack-trusty-hpcloud-b3-3341906:75fe1fde-6f4e-4b31-8b7c-7683d96e9b88

The first instance that tempest spawns fails just before the second periodic task sync. It looks like the initial resource sync picks up the nodes but does not update nova's resources, only the # of available hypervisors. Node properties (ram/mem/cpu) are associated with nodes when they are enrolled, so that data is being received on the nova side if its updating its hypervisor count.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

If we really wanted to band-aid this, we could enhance devstack's resource polling to also wait for mem/cpu/gb to reach an expected state, though I wonder if there's something happening in Nova that's causing this. This seems like a new issue we're hitting so perhaps something has changed recently on that front?

Changed in ironic:
status: New → Confirmed
status: Confirmed → New
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

OK, so we did some more digging here. Devananda caught that the hosts ssh credentials to access local libvirt are created after the nodes are enrolled. Ironic can't validate power state of the nodes until it can connect to libvirt, nova wont take into account a nodes resources until its power state has been validated, causing a delay in schedule-able nodes.

Changed in devstack:
status: Fix Released → Confirmed
Changed in ironic:
status: New → Invalid
Changed in nova:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/139770

Changed in devstack:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/139770
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=0354640587cde740aa0299c722f019ae1c01e05d
Submitter: Jenkins
Branch: master

commit 0354640587cde740aa0299c722f019ae1c01e05d
Author: Adam Gandelman <email address hidden>
Date: Fri Dec 5 16:49:12 2014 -0800

    Move ironic ssh key creation early in preparation

    SSH creds should be in place before nodes are enrolled. If not,
    ironic cannot sync power state causing nova to skip nodes in
    its resource tracker.

    Change-Id: I6b98ae57ce33783f69e2cf9ba357807d384b3012
    Closes-bug: #1398128

Changed in devstack:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/140417

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

So the merged devstack patch does not seem to help much, still the same failure altho ssh key creation has moved much earlier in the timeline:

# devstack
2014-12-09 18:13:42.602 | ++ configure_ironic_auxiliary # ssh key setup
2014-12-09 18:16:11.869 | ++ enroll_nodes
2014-12-09 18:16:36.082 | ++ '[' 3 -ge 3 ']' # nova reports hypervisor count=3

# Tempest first instance fail
Details: Server 6b7c6c7e-1e86-4f8f-a19f-8d93c350d2a0 failed to build and is in ERROR status
Details: {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'created': u'2014-12-09T18:17:20Z'}

# first n-cpu periodic to pick up node *count*
2014-12-09 18:16:34.322 30099 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager.update_available_resource run_periodic_tasks

# first n-cpu periodic to pick up node *resources*
2014-12-09 18:17:34.841 30099 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager.update_available_resource run_periodic_tasks

It looks as if the first periodic sync that picks up the enrolled nodes is still missing parameters that prevent the driver from including its node resources in nova's resource tracker. Some debug logging in Ironic would help here if its proving to be this non-obvious.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Looks like the ironic-conductor's first periodic task / sync_power_state() run post-enrollment doesn't happen until 18:16:46, which is after nova's initial discovery of the nodes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/140512

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/140512
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=0c99e2f65b6e86236c0d29928c110628f1e32f3d
Submitter: Jenkins
Branch: master

commit 0c99e2f65b6e86236c0d29928c110628f1e32f3d
Author: Adam Gandelman <email address hidden>
Date: Tue Dec 9 14:44:24 2014 -0800

    Poll resource tracker for ironic cpus as well as count

    When ironic nodes are enrolled, their resources are not available
    to the nova scheduler until after a round of ironic and nova periodic
    tasks have run In addition to waiting for ironic nodes to show up in
    the resource tracker, also wait for associated CPU resources. In
    the worst case, this means waiting for 3 total rounds of periodic
    tasks.

    Change-Id: Idbbc43bf74ff5fff3d50f3494148454bb51e378f
    Closes-bug: #1398128

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/141697

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (stable/juno)

Reviewed: https://review.openstack.org/140417
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=a6a31428575eb12e6ff58c64e6047cb0c1498899
Submitter: Jenkins
Branch: stable/juno

commit a6a31428575eb12e6ff58c64e6047cb0c1498899
Author: Adam Gandelman <email address hidden>
Date: Fri Dec 5 16:49:12 2014 -0800

    Move ironic ssh key creation early in preparation

    SSH creds should be in place before nodes are enrolled. If not,
    ironic cannot sync power state causing nova to skip nodes in
    its resource tracker.

    Closes-bug: #1398128
    (cherry picked from commit 0354640587cde740aa0299c722f019ae1c01e05d)

    Conflicts:
     lib/ironic

    Change-Id: I6b98ae57ce33783f69e2cf9ba357807d384b3012

tags: added: in-stable-juno
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/141697
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=67d547c9bbd8a7568ac2250deca46d7cf5ee3995
Submitter: Jenkins
Branch: stable/juno

commit 67d547c9bbd8a7568ac2250deca46d7cf5ee3995
Author: Adam Gandelman <email address hidden>
Date: Tue Dec 9 14:44:24 2014 -0800

    Poll resource tracker for ironic cpus as well as count

    When ironic nodes are enrolled, their resources are not available
    to the nova scheduler until after a round of ironic and nova periodic
    tasks have run In addition to waiting for ironic nodes to show up in
    the resource tracker, also wait for associated CPU resources. In
    the worst case, this means waiting for 3 total rounds of periodic
    tasks.

    Closes-bug: #1398128
    (cherry picked from commit 0c99e2f65b6e86236c0d29928c110628f1e32f3d)

    Conflicts:
     lib/ironic

    Change-Id: Idbbc43bf74ff5fff3d50f3494148454bb51e378f

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Opened a new bug 1445917 for the new failure.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.