OpenStack-Gate

Zuul v3 tasks can end up in an UNREACHABLE state

Bug #1721093 reported by David Moreau Simard on 2017-10-03

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack-Gate	Confirmed	Undecided	Unassigned

Bug Description

For reasons unknown, we have a high amount of tasks with Zuul v3 that can end up in an unreachable state, like the following:

http://logs.openstack.org/60/508660/13/check/legacy-tripleo-ci-centos-7-scenario002-multinode-oooq-puppet/7eaaf51/job-output.txt.gz

2017-10-03 12:50:31.321993 | fatal: [secondary]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"158.69.83.193\". Make sure this host can be reached over ssh", "unreachable": true}
2017-10-03 12:50:31.322013 | fatal: [primary]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"158.69.83.191\". Make sure this host can be reached over ssh", "unreachable": true}

Let's track this in elastic recheck to get a better understanding of the issue.

This is maybe related to https://bugs.launchpad.net/openstack-gate/+bug/1718197

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-22:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Make%20sure%20this%20host%20can%20be%20reached%20over%20ssh%5C%5C%5C%22%2C%20%5C%5C%5C%22unreachable%5C%5C%5C%22%3A%20true%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

Changed in openstack-gate:
status:	New → Confirmed

Revision history for this message

Paul Belanger (pabelanger) wrote on 2018-02-27:

This is actually a result of zuul.o.o losing access to zookeeper (nodepool.o.o) and then nodepool-launcher seeing the locks on the zookeeper requests being removed.

Then nodepool-launcher will delete all the nodes in the gate, under the running jobs and ansible will not be able to SSH into that node (because it is gone).

The fix, is to move to the zookeeper cluster (zk01 / zk02 / zk03) and update zuul.o.o and nodepool-launcher to use it.

Revision history for this message

Sorin Sbarnea (ssbarnea) wrote on 2018-12-05:

I think that the splitted query got broken when we started to have ansible result printed as formatted json insteadd of one line json.

It fails to spot this: http://logs.openstack.org/04/618604/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/e892675/job-output.txt.gz#_2018-11-29_12_02_57_407169

On the other hand if we woudl look for "POST-RUN END RESULT_UNREACHABLE" we could be sure that we match it.

This makes me believe that should improve current query from:
ttps://github.com/openstack-infra/elastic-recheck/blob/master/queries/1721093.yaml

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.