CI jobs failing with "Message: No valid host was found. There are not enough hosts available., Code: 500"

Bug #1599858 reported by Derek Higgins on 2016-07-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Derek Higgins

Bug Description

Ever since the move to rh2, we've been seeing a lot of jobs failing (not all but too many) with

2016-07-07 11:42:39.204687 | 2016-07-07 11:42:26 [overcloud]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.Compute.resources[0].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

Tags: ci Edit Tag help
Derek Higgins (derekh) wrote :
Download full text (4.9 KiB)

From postci log, one of the node failed to deploy e.g.
http://logs.openstack.org/34/335434/3/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/085ed2a/logs/postci.txt.gz

+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+
| 9a37c72f-eed2-4dfe-a6ef-11d336be716b | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.0.2.19 |
| b49a7770-c4db-44db-a822-7b3ee631b58a | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.0.2.13 |
| 233d96e9-5ff4-4e16-b730-5cb740d10d51 | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.0.2.20 |
| 6eabef07-f41d-4d81-92ae-59da36e828f0 | overcloud-novacompute-0 | ERROR | - | NOSTATE | |
+--------------------------------------+-------------------------+--------+------------+-------------+---------------------+

The times variuos nodes were created

11:33:54 979f33b0-0df2-4565-b475-25c87296af78
11:34:18 b49a7770-c4db-44db-a822-7b3ee631b58a OK
11:34:28 9a37c72f-eed2-4dfe-a6ef-11d336be716b OK
11:34:36 233d96e9-5ff4-4e16-b730-5cb740d10d51 OK
11:36:17 3bee4016-f6b8-4239-853e-b36d5955667c
11:36:32 37cfeee2-d694-4a39-8421-e84348bd6906
11:36:49 393d0386-c9e1-4549-9917-6803676497a6
11:37:23 d9965837-28fa-4cf9-8d1c-86c656f7810b
11:38:15 6eabef07-f41d-4d81-92ae-59da36e828f0 ERROR

The very first on in this case failed, the nova logs show ironic waiting for some time

nova/nova-compute.log:2016-07-07 11:35:43.132 3716 DEBUG nova.virt.ironic.driver [-] [instance: 979f33b0-0df2-4565-b475-25c87296af78] Still waiting for ironic node 2ebfe30e-df9b-4383-81d1-61b5c3443810 to become ACTIVE: power_state="power off", target_power_state=None, provision_state="deploying", target_provision_state="active" _log_ironic_polling /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:118

until eventually
nova/nova-compute.log:2016-07-07 11:35:45.146 3716 ERROR oslo.service.loopingcall InstanceDeployFailure: Failed to provision instance 979f33b0-0df2-4565-b475-25c87296af78: Timeout reached while waiting for callback for node 2ebfe30e-df9b-4383-81d1-61b5c3443810

After this all of the retries (from heat?) fail
grep "start: 4, end: 0" nova/nova-scheduler.log
2016-07-07 11:36:18.069 1929 INFO nova.filters [req-6e2bcd70-d5e0-4ffe-af60-6bbd9a227eee 620cdc7716ad4b199de8315d3e3199b7 57bbfbf0ad5b4d46af40c681bbf66da9 - - -] Filtering removed all hosts for the request with instance ID '3bee4016-f6b8-4239-853e-b36d5955667c'. Filter results: ['RetryFilter: (start: 4, end: 4)', 'TripleOCapabilitiesFilter: (start: 4, end: 4)', 'ComputeCapabilitiesFilter: (start: 4, end: 4)', 'AvailabilityZoneFilter: (start: 4, end: 4)', 'RamFilter: (start: 4, end: 0)']
2016-07-07 11:36:32.717 1929 INFO nova.filters [req-8f78197c-1676-4d41-9004-6514106914e5 620cdc7716ad4b199de8315d3e3199b7 57bbfbf0ad5b4d46af40c681bbf66da9 - - -] Filtering removed all hosts for...

Read more...

Derek Higgins (derekh) wrote :

I believe whats going on here is that ARP traffic from one CI test is leaking into the tenant provision network of another CI test on the same cloud (rh2).

When the IPA image boots into the node, dhclient get an IP via dhcp,
One check dhcient does before assigning the IP address to the interface is send out an ARP request for the IP to make sure there are no collisions.

In the error case a node on another tenant network is responding to the ARP request and the deploy image doesn't get assigned the IP and just waits until nova/ironic times out.

I can't confirm if this is happening during CI or not, we would need access to the running environments that we don't have but I have come across it happening on an early deployment on rh2, I never got to the bottom of exactly what corner case triggers the problem.

I'm going to submit a patch to tripleo-ci so that each CI job uses a different subnet in its cloud, this should lower the probability of triggering the scenario above. If this improves the problem then it will indicate we are on the right track.

Steven Hardy (shardy) on 2016-07-08
Changed in tripleo:
milestone: none → ongoing
assignee: nobody → Derek Higgins (derekh)
Emilien Macchi (emilienm) wrote :

This bug was last updated over 180 days ago, as tripleo is a fast moving project and we'd like to get the tracker down to currently actionable bugs, this is getting marked as Invalid. If the issue still exists, please feel free to reopen it.

Changed in tripleo:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers