Comment 7 for bug 1762419

Revision history for this message
Matt Young (halcyondude) wrote :

since the other issue is marked as a duplicate of this one, yet contains a writeup of what's failing that someone (not on our team) can digest, I'm including it here.

(from https://bugs.launchpad.net/tripleo/+bug/1770684)

===

The core issue here is that our jobs are not capturing the logs needed to diagnose failures in the upload job. Specifically the latter portion of the job that creates the final images using libguestfs / virt-customize creates logs, however these are not emitted in the job console log and/or collected by the CI jobs. In this particular case not being able to quickly diagnose the issue resulted in multiple days of gate outages for the tripleo project, as well as any other projects (openstack wide) that have opted to include tripleo gates.

As the path thru the various CI scripts/playbooks/roles is not obvious, what follows is a walkthru leading up to the failure observed. Final root cause / diagnoses for this issue was enabled by RDO team members that were able to access the nodes running in CI, pull one out of rotation and diagnose on-box...something that would have not been required if we had the correct artifacts collected.

---

The failure:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/cea1811/console.txt.gz#_2018-05-10_21_18_35_446

fatal: [localhost]: FAILED! => {"changed": true, "cmd": "virt-customize -v -x --run .//repo_setup.sh -a overcloud-full.qcow2 > .//.__repo_setup.sh.log 2>&1"

virt-customize step failed here:

https://github.com/openstack/tripleo-quickstart-extras/blob/af40c1e847e044b159d76a4cc7ea87eb6d7c570c/roles/modify-image/tasks/libguestfs.yml#L51

--- walkthru to point of failure ---

job:

* periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload

This job was running the following CI script:

* https://github.com/rdo-infra/ci-config/blob/master/ci-scripts/tripleo-upstream/convert-upload-undercloud.sh

The playbook that's failing (convert-overcloud-undercloud.yml) is created by the script dynamically via a bash "here document"

* http://tldp.org/LDP/abs/html/here-docs.html
* https://github.com/rdo-infra/ci-config/blob/2ea0c014fd35d2080d0594408c62fefe262f15f6/ci-scripts/tripleo-upstream/convert-upload-undercloud.sh#L18

The emitted playbook includes the 'repo-setup- role

* https://github.com/openstack/tripleo-quickstart/tree/master/roles/repo-setup

The final step of this role uses the 'modify-image' role

* https://github.com/openstack/tripleo-quickstart/blob/master/roles/repo-setup/tasks/inject_repos_into_image.yml

```
- name: Inject the repositories into the image
  include_role:
    name: modify-image
  vars:
    image_to_modify: "{{ repo_inject_image_path }}"
    modify_script: "{{ repo_setup_dir }}/{{ repo_setup_script }}"
```

The modify-image role then runs a virt-customize operation

* https://github.com/openstack/tripleo-quickstart-extras/tree/master/roles/modify-image

this brings us to the actual failure:

* https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/modify-image/tasks/libguestfs.yml#L51

TASK [modify-image : Run virt-customize on the provided image] *****************
  task path: /home/jenkins/workspace/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/.quickstart/usr/local/share/ansible/roles/modify-image/tasks/libguestfs.yml:51
  Thursday 10 May 2018 21:18:34 +0000 (0:00:00.055) 0:00:21.482 **********
  fatal: [localhost]: FAILED! => {"changed": true, "cmd": "virt-customize -v -x --run .//repo_setup.sh -a overcloud-full.qcow2 > .//.__repo_setup.sh.log 2>&1", "delta": "0:00:00.981176", "end": "2018-05-10 21:18:35.430636", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-05-10 21:18:34.449460", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

--- end walkthru ---

ROOT CAUSE of the specific issue (note: not the point of this LP)

* fixed by https://review.rdoproject.org/r/#/c/13737/2/ci-scripts/tripleo-upstream/convert-upload-undercloud.sh

```
Set kernel override to centos 7.5 kernel

Default value for libguestfs_kernel_override is 3.10.0-693.el7.x86_64, which do not work with centos 7.5, setting it to centos 7.5 kernel: 3.10.0-862.2.3.el7.x86_64.
amoralej, ykarel have been working on 7.5 kernel and libguestfs
```

* https://review.rdoproject.org/r/#/q/topic:bug/1743749
* https://bugs.launchpad.net/tripleo/+bug/1743749
* root cause for actual kernel bug is https://bugzilla.redhat.com/show_bug.cgi?id=1535973

--- end walkthru ---

Potential fixes for this issue:

- update collect-logs role to capture the modify-image virt-customize logs
- update modify-image role to tee (vs. a simple '>' redirect)

IMHO we should do both.