multinode jobs failing on gathering facts from subnode-2

Bug #1810054 reported by Alex Schultz on 2018-12-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Ronelle Landy

Bug Description

http://logs.openstack.org/92/625692/2/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/b38e0b1/job-output.txt.gz#_2018-12-29_20_27_24_472604

2018-12-29 20:27:23.969381 | primary | TASK [tripleo-inventory : include_tasks] ***************************************
2018-12-29 20:27:24.377851 | primary | Saturday 29 December 2018 20:27:24 +0000 (0:00:00.502) 0:00:11.230 *****
2018-12-29 20:27:24.417372 | primary | skipping: [undercloud]
2018-12-29 20:27:24.472440 | primary |
2018-12-29 20:27:24.472604 | primary | PLAY [Create configs on subnodes] **********************************************
2018-12-29 20:27:24.561679 | primary |
2018-12-29 20:27:24.561831 | primary | TASK [Gathering Facts] *********************************************************
2018-12-29 20:27:24.595243 | primary | Saturday 29 December 2018 20:27:24 +0000 (0:00:00.217) 0:00:11.447 *****
2018-12-29 20:27:40.293283 | primary | fatal: [subnode-2]: UNREACHABLE! => {
2018-12-29 20:27:40.293401 | primary | "changed": false,
2018-12-29 20:27:40.293461 | primary | "unreachable": true
2018-12-29 20:27:40.293495 | primary | }

http://logs.openstack.org/93/599593/21/gate/tripleo-ci-centos-7-containers-multinode/5fcf3e3/job-output.txt.gz#_2018-12-29_20_38_19_321735

2018-12-29 20:38:18.801509 | primary | TASK [tripleo-inventory : include_tasks] ***************************************
2018-12-29 20:38:19.063035 | primary | Saturday 29 December 2018 20:38:19 +0000 (0:00:00.356) 0:00:16.397 *****
2018-12-29 20:38:19.113189 | primary | skipping: [undercloud]
2018-12-29 20:38:19.176758 | primary |
2018-12-29 20:38:19.176956 | primary | PLAY [Create configs on subnodes] **********************************************
2018-12-29 20:38:19.321591 | primary |
2018-12-29 20:38:19.321735 | primary | TASK [Gathering Facts] *********************************************************
2018-12-29 20:38:19.430729 | primary | Saturday 29 December 2018 20:38:19 +0000 (0:00:00.367) 0:00:16.765 *****
2018-12-29 20:38:34.655193 | primary | fatal: [subnode-2]: UNREACHABLE! => {
2018-12-29 20:38:34.655341 | primary | "changed": false,
2018-12-29 20:38:34.655404 | primary | "unreachable": true
2018-12-29 20:38:34.655440 | primary | }

Ronelle Landy (rlandy) on 2019-01-02
Changed in tripleo:
assignee: nobody → Ronelle Landy (rlandy)
Ronelle Landy (rlandy) wrote :

Looking at the logs, the errors occur when the host is: centos-7-limestone-regionone-*

Ronelle Landy (rlandy) wrote :

Latest failure:
2019-01-02 15:57:12.946423 | localhost | Hostname: centos-7-limestone-regionone-0001456752
2019-01-02 16:03:11.725793 | primary | fatal: [subnode-2]: UNREACHABLE! => {

http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/job-output.txt.gz#_2019-01-02_16_03_11_725793

Ronelle Landy (rlandy) wrote :

Looking if we can exclude certain providers for multinode jobs.

David Moreau Simard (dmsimard) wrote :

This can be identified in logstash with this query:
message:"fatal: [subnode-2]: UNREACHABLE!"

See link (up duration to 7 days at top right):
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22fatal%3A%20%5Bsubnode-2%5D%3A%20UNREACHABLE!%5C%22

David Moreau Simard (dmsimard) wrote :

See screenshot for posterity.

wes hayutin (weshayutin) wrote :

2019-01-02 16:02:56.441067 | primary | PLAY [Create configs on subnodes] **********************************************
2019-01-02 16:02:56.509946 | primary |
2019-01-02 16:02:56.510092 | primary | TASK [Gathering Facts] *********************************************************
2019-01-02 16:02:56.551647 | primary | Wednesday 02 January 2019 16:02:56 +0000 (0:00:00.181) 0:00:13.974 *****
2019-01-02 16:03:11.725793 | primary | fatal: [subnode-2]: UNREACHABLE! => {
2019-01-02 16:03:11.725914 | primary | "changed": false,
2019-01-02 16:03:11.725968 | primary | "unreachable": true
2019-01-02 16:03:11.725998 | primary | }
2019-01-02 16:03:11.726026 | primary |
2019-01-02 16:03:11.726056 | primary | MSG:
2019-01-02 16:03:11.726083 | primary |
2019-01-02 16:03:11.726204 | primary | SSH Error: data could not be sent to remote host "10.4.70.74". Make sure this host can be reached over ssh

http://logs.openstack.org/64/625164/1/gate/tripleo-ci-centos-7-containers-multinode/f5a9c7d/job-output.txt.gz#_2019-01-02_16_02_56_441067

gate failure :(

David Moreau Simard (dmsimard) wrote :

It's worth mentioning that TripleO jobs do not have a 100% failure rate on limestone.
In fact, I found a pattern when comparing failed jobs to successful ones.

Limestone exposes two kinds of CPUs:
- Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
- Intel Xeon E3-12xx v2 (Ivy Bridge, IBRS)

From the several failures I've looked at, they all used the E3-12xx variant. I was not able to find failures that used the E5-2650 CPU. This could be a coincidence, though.

We've discussed this in #openstack-infra [1] and Logan from Limestone was able to identify that some compute nodes had nova to use "host-model" as the CPU model which ended up being exposed as the E3-12xx CPU.
Compute nodes that were correctly configured to use "host-passthrough" exposed the E5-2650 CPU.

It's possible that this may have caused issues if TripleO attempted to set up nested virtualization [2].

Every compute node in Limestone should now properly use "host-passthrough" and I've sent a review to add an elastic-recheck query for this bug: https://review.openstack.org/#/c/628034/

[1]: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2019-01-02.log.html#t2019-01-02T19:50:28
[2]: https://github.com/openstack/tripleo-quickstart-extras/blob/49f22ec31af603010c0cbe2cefd86cbc751768de/playbooks/multinode-undercloud.yml#L88

Ronelle Landy (rlandy) wrote :

No occurrences of the error "message:"fatal: [subnode-2]: UNREACHABLE!" since 11pm on 01/02.
Will continue to watch this

David Moreau Simard (dmsimard) wrote :

This is now tracked in elastic-recheck: http://status.openstack.org/elastic-recheck/#1810054

summary: - mulitnode jobs failing on gathering facts from subnode-2
+ multinode jobs failing on gathering facts from subnode-2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers