Zuul v3 tasks can end up in an UNREACHABLE state

Bug #1721093 reported by David Moreau Simard
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack-Gate
Confirmed
Undecided
Unassigned

Bug Description

For reasons unknown, we have a high amount of tasks with Zuul v3 that can end up in an unreachable state, like the following:

http://logs.openstack.org/60/508660/13/check/legacy-tripleo-ci-centos-7-scenario002-multinode-oooq-puppet/7eaaf51/job-output.txt.gz

2017-10-03 12:50:31.321993 | fatal: [secondary]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"158.69.83.193\". Make sure this host can be reached over ssh", "unreachable": true}
2017-10-03 12:50:31.322013 | fatal: [primary]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"158.69.83.191\". Make sure this host can be reached over ssh", "unreachable": true}

Let's track this in elastic recheck to get a better understanding of the issue.

This is maybe related to https://bugs.launchpad.net/openstack-gate/+bug/1718197

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in openstack-gate:
status: New → Confirmed
Revision history for this message
Paul Belanger (pabelanger) wrote :

This is actually a result of zuul.o.o losing access to zookeeper (nodepool.o.o) and then nodepool-launcher seeing the locks on the zookeeper requests being removed.

Then nodepool-launcher will delete all the nodes in the gate, under the running jobs and ansible will not be able to SSH into that node (because it is gone).

The fix, is to move to the zookeeper cluster (zk01 / zk02 / zk03) and update zuul.o.o and nodepool-launcher to use it.

Revision history for this message
Sorin Sbarnea (ssbarnea) wrote :

I think that the splitted query got broken when we started to have ansible result printed as formatted json insteadd of one line json.

It fails to spot this: http://logs.openstack.org/04/618604/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/e892675/job-output.txt.gz#_2018-11-29_12_02_57_407169

On the other hand if we woudl look for "POST-RUN END RESULT_UNREACHABLE" we could be sure that we match it.

This makes me believe that should improve current query from:
ttps://github.com/openstack-infra/elastic-recheck/blob/master/queries/1721093.yaml

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.