multinode jobs failing on gathering facts from subnode-2

Bug #1810054 reported by Alex Schultz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Critical
Ronelle Landy

Bug Description

http://logs.openstack.org/92/625692/2/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/b38e0b1/job-output.txt.gz#_2018-12-29_20_27_24_472604

2018-12-29 20:27:23.969381 | primary | TASK [tripleo-inventory : include_tasks] ***************************************
2018-12-29 20:27:24.377851 | primary | Saturday 29 December 2018 20:27:24 +0000 (0:00:00.502) 0:00:11.230 *****
2018-12-29 20:27:24.417372 | primary | skipping: [undercloud]
2018-12-29 20:27:24.472440 | primary |
2018-12-29 20:27:24.472604 | primary | PLAY [Create configs on subnodes] **********************************************
2018-12-29 20:27:24.561679 | primary |
2018-12-29 20:27:24.561831 | primary | TASK [Gathering Facts] *********************************************************
2018-12-29 20:27:24.595243 | primary | Saturday 29 December 2018 20:27:24 +0000 (0:00:00.217) 0:00:11.447 *****
2018-12-29 20:27:40.293283 | primary | fatal: [subnode-2]: UNREACHABLE! => {
2018-12-29 20:27:40.293401 | primary | "changed": false,
2018-12-29 20:27:40.293461 | primary | "unreachable": true
2018-12-29 20:27:40.293495 | primary | }

http://logs.openstack.org/93/599593/21/gate/tripleo-ci-centos-7-containers-multinode/5fcf3e3/job-output.txt.gz#_2018-12-29_20_38_19_321735

2018-12-29 20:38:18.801509 | primary | TASK [tripleo-inventory : include_tasks] ***************************************
2018-12-29 20:38:19.063035 | primary | Saturday 29 December 2018 20:38:19 +0000 (0:00:00.356) 0:00:16.397 *****
2018-12-29 20:38:19.113189 | primary | skipping: [undercloud]
2018-12-29 20:38:19.176758 | primary |
2018-12-29 20:38:19.176956 | primary | PLAY [Create configs on subnodes] **********************************************
2018-12-29 20:38:19.321591 | primary |
2018-12-29 20:38:19.321735 | primary | TASK [Gathering Facts] *********************************************************
2018-12-29 20:38:19.430729 | primary | Saturday 29 December 2018 20:38:19 +0000 (0:00:00.367) 0:00:16.765 *****
2018-12-29 20:38:34.655193 | primary | fatal: [subnode-2]: UNREACHABLE! => {
2018-12-29 20:38:34.655341 | primary | "changed": false,
2018-12-29 20:38:34.655404 | primary | "unreachable": true
2018-12-29 20:38:34.655440 | primary | }

Tags: alert ci
Revision history for this message
Alex Schultz (alex-schultz) wrote :
Ronelle Landy (rlandy)
Changed in tripleo:
assignee: nobody → Ronelle Landy (rlandy)
Revision history for this message
Ronelle Landy (rlandy) wrote :

Looking at the logs, the errors occur when the host is: centos-7-limestone-regionone-*

Revision history for this message
Ronelle Landy (rlandy) wrote :

Latest failure:
2019-01-02 15:57:12.946423 | localhost | Hostname: centos-7-limestone-regionone-0001456752
2019-01-02 16:03:11.725793 | primary | fatal: [subnode-2]: UNREACHABLE! => {

http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/job-output.txt.gz#_2019-01-02_16_03_11_725793

Revision history for this message
Ronelle Landy (rlandy) wrote :

Looking if we can exclude certain providers for multinode jobs.

Revision history for this message
David Moreau Simard (dmsimard) wrote :

This can be identified in logstash with this query:
message:"fatal: [subnode-2]: UNREACHABLE!"

See link (up duration to 7 days at top right):
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22fatal%3A%20%5Bsubnode-2%5D%3A%20UNREACHABLE!%5C%22

Revision history for this message
David Moreau Simard (dmsimard) wrote :

See screenshot for posterity.

Revision history for this message
wes hayutin (weshayutin) wrote :

2019-01-02 16:02:56.441067 | primary | PLAY [Create configs on subnodes] **********************************************
2019-01-02 16:02:56.509946 | primary |
2019-01-02 16:02:56.510092 | primary | TASK [Gathering Facts] *********************************************************
2019-01-02 16:02:56.551647 | primary | Wednesday 02 January 2019 16:02:56 +0000 (0:00:00.181) 0:00:13.974 *****
2019-01-02 16:03:11.725793 | primary | fatal: [subnode-2]: UNREACHABLE! => {
2019-01-02 16:03:11.725914 | primary | "changed": false,
2019-01-02 16:03:11.725968 | primary | "unreachable": true
2019-01-02 16:03:11.725998 | primary | }
2019-01-02 16:03:11.726026 | primary |
2019-01-02 16:03:11.726056 | primary | MSG:
2019-01-02 16:03:11.726083 | primary |
2019-01-02 16:03:11.726204 | primary | SSH Error: data could not be sent to remote host "10.4.70.74". Make sure this host can be reached over ssh

http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/job-output.txt.gz#_2019-01-02_16_02_56_441067

gate failure :(

Revision history for this message
David Moreau Simard (dmsimard) wrote :

It's worth mentioning that TripleO jobs do not have a 100% failure rate on limestone.
In fact, I found a pattern when comparing failed jobs to successful ones.

Limestone exposes two kinds of CPUs:
- Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
- Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)

From the several failures I've looked at, they all used the E3-12xx variant. I was not able to find failures that used the E5-2650 CPU. This could be a coincidence, though.

We've discussed this in #openstack-infra [1] and Logan from Limestone was able to identify that some compute nodes had nova to use "host-model" as the CPU model which ended up being exposed as the E3-12xx CPU.
Compute nodes that were correctly configured to use "host-passthrough" exposed the E5-2650 CPU.

It's possible that this may have caused issues if TripleO attempted to set up nested virtualization [2].

Every compute node in Limestone should now properly use "host-passthrough" and I've sent a review to add an elastic-recheck query for this bug: https://review.openstack.org/#/c/628034/

[1]: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2019-01-02.log.html#t2019-01-02T19:50:28
[2]: https://github.com/openstack/tripleo-quickstart-extras/blob/49f22ec31af603010c0cbe2cefd86cbc751768de/playbooks/multinode-undercloud.yml#L88

Revision history for this message
Ronelle Landy (rlandy) wrote :

No occurrences of the error "message:"fatal: [subnode-2]: UNREACHABLE!" since 11pm on 01/02.
Will continue to watch this

Revision history for this message
David Moreau Simard (dmsimard) wrote :

This is now tracked in elastic-recheck: http://status.openstack.org/elastic-recheck/#1810054

summary: - mulitnode jobs failing on gathering facts from subnode-2
+ multinode jobs failing on gathering facts from subnode-2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.