CI: jobs failing with Message: No valid host was found. There are not enough hosts available., Code: 500"

Bug #1585641 reported by James Slagle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
James Slagle

Bug Description

Seeing lots of CI jobs (but not all) failing with errors similar to:
2016-05-25 12:31:09.983 | 2016-05-25 12:30:58 [overcloud]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.CephStorage.resources[0].resources.CephStorage: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

Example failed job:
http://logs.openstack.org/35/320035/2/check-tripleo/gate-tripleo-ci-f22-nonha/20c2908/

Revision history for this message
James Slagle (james-slagle) wrote :
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
Revision history for this message
James Slagle (james-slagle) wrote :
Download full text (4.3 KiB)

in this failed job:
http://logs.openstack.org/35/320035/2/check-tripleo/gate-tripleo-ci-f22-ha/8535e54/

i'm seeing these errors in the /var/log/neutron/openvswitch-agent.log from the undercloud:

2016-05-25 12:05:19.841 19427 DEBUG neutron.agent.linux.utils [req-93b1ef19-f4f0-41bd-b9b7-030b97db39a9 - - - - -] Running command (rootwrap daemon): ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=ofport', 'list', 'Interface', 'int-br-ctlplane'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:100
2016-05-25 12:05:19.854 19427 ERROR neutron.agent.ovsdb.impl_vsctl [req-93b1ef19-f4f0-41bd-b9b7-030b97db39a9 - - - - -] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=ofport', 'list', 'Interface', 'int-br-ctlplane']. Exception: Exit code: 1; Stdin: ; Stdout: ; Stderr: ovs-vsctl: no row "int-br-ctlplane" in table Interface

2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib [req-93b1ef19-f4f0-41bd-b9b7-030b97db39a9 - - - - -] Timed out retrieving ofport on port int-br-ctlplane.
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib Traceback (most recent call last):
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib File "/usr/lib/python2.7/site-packages/neutron/agent/common/ovs_lib.py", line 293, in get_port_ofport
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib ofport = self._get_port_ofport(port_name)
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib File "/usr/lib/python2.7/site-packages/neutron/agent/common/ovs_lib.py", line 92, in wrapped
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib return new_fn(*args, **kwargs)
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib File "/usr/lib/python2.7/site-packages/retrying.py", line 68, in wrapped_f
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib return Retrying(*dargs, **dkw).call(f, *args, **kw)
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib File "/usr/lib/python2.7/site-packages/retrying.py", line 231, in call
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib raise RetryError(attempt)
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib RetryError: RetryError[Attempts: 16, Value: None]
2016-05-25 12:05:19.856 19427 ERROR neutron.agent.common.ovs_lib
2016-05-25 12:05:19.875 19427 DEBUG neutron.agent.linux.utils [req-93b1ef19-f4f0-41bd-b9b7-030b97db39a9 - - - - -] Running command (rootwrap daemon): ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--may-exist', 'add-port', 'br-int', 'int-br-ctlplane', '--', 'set', 'Interface', 'int-br-ctlplane', 'type=patch', 'options:peer=nonexistent-peer'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:100
2016-05-25 12:05:19.889 19427 DEBUG neutron.agent.linux.utils [req-93b1ef19-f4f0-41bd-b9b7-030b97db39a9 - - - - -] Exit code: 0 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:142
2016-05-25 12:05:19.890 19427 DEBUG neutron.agent.linux.utils [req-93b1ef19-f4f0-41bd-b9b7-030b97db39a9 - - - - -] Run...

Read more...

Revision history for this message
James Slagle (james-slagle) wrote :

ok, not sure above openvswitch error is actually related, as it goes away once the rest of the neutron servers come up, and you dont see it during actual deployment

Revision history for this message
James Slagle (james-slagle) wrote :

i do see iscsi errors in the ironic conductor log:
3130 2016-05-25 12:25:33.717 16490 DEBUG ironic.common.utils [req-f70acdab-b9b7-4622-916b-43b949354588 - - - - -] Command stderr is: "iscsiadm: invalid error code 65280
3131 iscsiadm: Could not execute operation on all sessions: (null)
3132 " execute /usr/lib/python2.7/site-packages/ironic/common/utils.py:92

after that happens, it looks like nova tries to reschedule the instances, but all nodes are still seen as in use with not enough memory available

Revision history for this message
James Slagle (james-slagle) wrote :

fwiw, this appears to be the first occurrence of the error:
http://logs.openstack.org/31/316531/1/check-tripleo/gate-tripleo-ci-f22-nonha/27a01c6/

Revision history for this message
James Slagle (james-slagle) wrote :
Revision history for this message
James Slagle (james-slagle) wrote :

the previous delorean repo we were using before the error started showing was:
http://trunk.rdoproject.org/centos7/96/73/9673e634966df3792d4103e5fd7ac156cd9ba677_8d5bf55b/

So:

Good repo: http://trunk.rdoproject.org/centos7/96/73/9673e634966df3792d4103e5fd7ac156cd9ba677_8d5bf55b/
Bad repo: http://trunk.rdoproject.org/centos7/39/d0/39d0fdf0de72a3e76716948497b2b5009d1c64b6_8df7385a/

to see the ironic commits between these 2 repos use:
git log c28a95a..6cd2f21

to see the nova commits between these 2 repos use:
git log 38033b9..97c05f5

not saying i have proof it was definitely one of those commits, but it seems likely

Revision history for this message
James Slagle (james-slagle) wrote :

i'm testing the older ironic package from the dlrn repo where we didnt see the issue in this commit:
https://review.openstack.org/#/c/321113/

figured we could recheck that a few times and see if we see the issue or not

Changed in tripleo:
assignee: nobody → James Slagle (james-slagle)
Revision history for this message
James Slagle (james-slagle) wrote :

that CI test did eventually hit the error, so that means it's probably not a commit in Ironic. Focusing in on the commits in Nova, it looks like this might be the problem:
https://review.openstack.org/#/c/306670/

as that change in behavior corresponds exactly with what I'm seeing in the job logs: some nodes usually start reporting memory_mb=0 and memory_mb_used=0 making them unavailable for scheduling from the RamFilter's perspective even though no instances are assigned to them.

going to try the CI test again with the downgraded nova package and see if I'm able to reproduce.

Revision history for this message
James Slagle (james-slagle) wrote :
Revision history for this message
James Slagle (james-slagle) wrote :

this has been fixed with this nova patch: https://review.openstack.org/#/c/326100/
which was a partial revert

Changed in tripleo:
milestone: none → newton-2
status: Triaged → Fix Committed
Steven Hardy (shardy)
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.