since the other issue is marked as a duplicate of this one, yet contains a writeup of what's failing that someone (not on our team) can digest, I'm including it here.
The core issue here is that our jobs are not capturing the logs needed to diagnose failures in the upload job. Specifically the latter portion of the job that creates the final images using libguestfs / virt-customize creates logs, however these are not emitted in the job console log and/or collected by the CI jobs. In this particular case not being able to quickly diagnose the issue resulted in multiple days of gate outages for the tripleo project, as well as any other projects (openstack wide) that have opted to include tripleo gates.
As the path thru the various CI scripts/playbooks/roles is not obvious, what follows is a walkthru leading up to the failure observed. Final root cause / diagnoses for this issue was enabled by RDO team members that were able to access the nodes running in CI, pull one out of rotation and diagnose on-box...something that would have not been required if we had the correct artifacts collected.
Default value for libguestfs_kernel_override is 3.10.0-693.el7.x86_64, which do not work with centos 7.5, setting it to centos 7.5 kernel: 3.10.0-862.2.3.el7.x86_64.
amoralej, ykarel have been working on 7.5 kernel and libguestfs
```
since the other issue is marked as a duplicate of this one, yet contains a writeup of what's failing that someone (not on our team) can digest, I'm including it here.
(from https:/ /bugs.launchpad .net/tripleo/ +bug/1770684)
===
The core issue here is that our jobs are not capturing the logs needed to diagnose failures in the upload job. Specifically the latter portion of the job that creates the final images using libguestfs / virt-customize creates logs, however these are not emitted in the job console log and/or collected by the CI jobs. In this particular case not being able to quickly diagnose the issue resulted in multiple days of gate outages for the tripleo project, as well as any other projects (openstack wide) that have opted to include tripleo gates.
As the path thru the various CI scripts/ playbooks/ roles is not obvious, what follows is a walkthru leading up to the failure observed. Final root cause / diagnoses for this issue was enabled by RDO team members that were able to access the nodes running in CI, pull one out of rotation and diagnose on-box...something that would have not been required if we had the correct artifacts collected.
---
The failure:
https:/ /logs.rdoprojec t.org/openstack -periodic/ periodic- tripleo- ci-centos- 7-ovb-1ctlr_ 1comp-featurese t002-master- upload/ cea1811/ console. txt.gz# _2018-05- 10_21_18_ 35_446
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "virt-customize -v -x --run .//repo_setup.sh -a overcloud- full.qcow2 > .//.__repo_ setup.sh. log 2>&1"
virt-customize step failed here:
https:/ /github. com/openstack/ tripleo- quickstart- extras/ blob/af40c1e847 e044b159d76a4cc 7ea87eb6d7c570c /roles/ modify- image/tasks/ libguestfs. yml#L51
--- walkthru to point of failure ---
job:
* periodic- tripleo- ci-centos- 7-ovb-1ctlr_ 1comp-featurese t002-master- upload
This job was running the following CI script:
* https:/ /github. com/rdo- infra/ci- config/ blob/master/ ci-scripts/ tripleo- upstream/ convert- upload- undercloud. sh
The playbook that's failing (convert- overcloud- undercloud. yml) is created by the script dynamically via a bash "here document"
* http:// tldp.org/ LDP/abs/ html/here- docs.html /github. com/rdo- infra/ci- config/ blob/2ea0c014fd 35d2080d0594408 c62fefe262f15f6 /ci-scripts/ tripleo- upstream/ convert- upload- undercloud. sh#L18
* https:/
The emitted playbook includes the 'repo-setup- role
* https:/ /github. com/openstack/ tripleo- quickstart/ tree/master/ roles/repo- setup
The final step of this role uses the 'modify-image' role
* https:/ /github. com/openstack/ tripleo- quickstart/ blob/master/ roles/repo- setup/tasks/ inject_ repos_into_ image.yml
``` to_modify: "{{ repo_inject_ image_path }}"
- name: Inject the repositories into the image
include_role:
name: modify-image
vars:
image_
modify_script: "{{ repo_setup_dir }}/{{ repo_setup_script }}"
```
The modify-image role then runs a virt-customize operation
* https:/ /github. com/openstack/ tripleo- quickstart- extras/ tree/master/ roles/modify- image
this brings us to the actual failure:
* https:/ /github. com/openstack/ tripleo- quickstart- extras/ blob/master/ roles/modify- image/tasks/ libguestfs. yml#L51
TASK [modify-image : Run virt-customize on the provided image] ***************** workspace/ periodic- tripleo- ci-centos- 7-ovb-1ctlr_ 1comp-featurese t002-master- upload/ .quickstart/ usr/local/ share/ansible/ roles/modify- image/tasks/ libguestfs. yml:51 full.qcow2 > .//.__repo_ setup.sh. log 2>&1", "delta": "0:00:00.981176", "end": "2018-05-10 21:18:35.430636", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2018-05-10 21:18:34.449460", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
task path: /home/jenkins/
Thursday 10 May 2018 21:18:34 +0000 (0:00:00.055) 0:00:21.482 **********
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "virt-customize -v -x --run .//repo_setup.sh -a overcloud-
--- end walkthru ---
ROOT CAUSE of the specific issue (note: not the point of this LP)
* fixed by https:/ /review. rdoproject. org/r/# /c/13737/ 2/ci-scripts/ tripleo- upstream/ convert- upload- undercloud. sh
```
Set kernel override to centos 7.5 kernel
Default value for libguestfs_ kernel_ override is 3.10.0- 693.el7. x86_64, which do not work with centos 7.5, setting it to centos 7.5 kernel: 3.10.0- 862.2.3. el7.x86_ 64.
amoralej, ykarel have been working on 7.5 kernel and libguestfs
```
* https:/ /review. rdoproject. org/r/# /q/topic: bug/1743749 /bugs.launchpad .net/tripleo/ +bug/1743749 /bugzilla. redhat. com/show_ bug.cgi? id=1535973
* https:/
* root cause for actual kernel bug is https:/
--- end walkthru ---
Potential fixes for this issue:
- update collect-logs role to capture the modify-image virt-customize logs
- update modify-image role to tee (vs. a simple '>' redirect)
IMHO we should do both.