Collect logs task are not collecting the virt-customize logs from periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-$release-upload

Bug #1762419 reported by Arx Cruz on 2018-04-09
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
High
Matt Young

Bug Description

We notice a failure today in the virt-customize role and we weren't able to check the logs because the log collection is done in toc-test builder, and virt-customize is running in images-upload-and-label builder.

Changed in tripleo:
assignee: Rafael Folco (rafaelfolco) → nobody
Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
Matt Young (halcyondude) on 2018-04-13
Changed in tripleo:
importance: High → Critical
Gabriele Cerami (gcerami) wrote :

Either we relaunch collect logs in image-upload builder or we do a basic copy of any output from images-upload builder to the job logs dir.
Investigating the second option first.

Gabriele Cerami (gcerami) wrote :

one specific problem is that images-upload builder calls repo_setup which in turn calls modify-image, which creates a .__repo_setup.sh.log file which is not uploaded to the logs.
To copy these single files without using collect-logs role, we should know their position first, for everything that these roles create.
To copy these single files using collect-logs role, which knows the positions, we should deactivate everything else that collect-logs does, and hope the part we can't deactivate are idempotent.
The quickest way to fix it is to just copy every single file from a hardcoded path

Gabriele Cerami (gcerami) wrote :
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
assignee: Gabriele Cerami (gcerami) → Quique Llorente (quiquell)
Changed in tripleo:
status: Triaged → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
yatin (yatinkarel) wrote :

Still not working(copy command is not even executed) when the script above the copy command fails. The reason should be -x is used.

wes hayutin (weshayutin) on 2018-05-11
Changed in tripleo:
assignee: Quique Llorente (quiquell) → nobody
Matt Young (halcyondude) wrote :
Download full text (4.6 KiB)

since the other issue is marked as a duplicate of this one, yet contains a writeup of what's failing that someone (not on our team) can digest, I'm including it here.

(from https://bugs.launchpad.net/tripleo/+bug/1770684)

===

The core issue here is that our jobs are not capturing the logs needed to diagnose failures in the upload job. Specifically the latter portion of the job that creates the final images using libguestfs / virt-customize creates logs, however these are not emitted in the job console log and/or collected by the CI jobs. In this particular case not being able to quickly diagnose the issue resulted in multiple days of gate outages for the tripleo project, as well as any other projects (openstack wide) that have opted to include tripleo gates.

As the path thru the various CI scripts/playbooks/roles is not obvious, what follows is a walkthru leading up to the failure observed. Final root cause / diagnoses for this issue was enabled by RDO team members that were able to access the nodes running in CI, pull one out of rotation and diagnose on-box...something that would have not been required if we had the correct artifacts collected.

---

The failure:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/cea1811/console.txt.gz#_2018-05-10_21_18_35_446

fatal: [localhost]: FAILED! => {"changed": true, "cmd": "virt-customize -v -x --run .//repo_setup.sh -a overcloud-full.qcow2 > .//.__repo_setup.sh.log 2>&1"

virt-customize step failed here:

https://github.com/openstack/tripleo-quickstart-extras/blob/af40c1e847e044b159d76a4cc7ea87eb6d7c570c/roles/modify-image/tasks/libguestfs.yml#L51

--- walkthru to point of failure ---

job:

* periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload

This job was running the following CI script:

* https://github.com/rdo-infra/ci-config/blob/master/ci-scripts/tripleo-upstream/convert-upload-undercloud.sh

The playbook that's failing (convert-overcloud-undercloud.yml) is created by the script dynamically via a bash "here document"

* http://tldp.org/LDP/abs/html/here-docs.html
* https://github.com/rdo-infra/ci-config/blob/2ea0c014fd35d2080d0594408c62fefe262f15f6/ci-scripts/tripleo-upstream/convert-upload-undercloud.sh#L18

The emitted playbook includes the 'repo-setup- role

* https://github.com/openstack/tripleo-quickstart/tree/master/roles/repo-setup

The final step of this role uses the 'modify-image' role

* https://github.com/openstack/tripleo-quickstart/blob/master/roles/repo-setup/tasks/inject_repos_into_image.yml

```
- name: Inject the repositories into the image
  include_role:
    name: modify-image
  vars:
    image_to_modify: "{{ repo_inject_image_path }}"
    modify_script: "{{ repo_setup_dir }}/{{ repo_setup_script }}"
```

The modify-image role then runs a virt-customize operation

* https://github.com/openstack/tripleo-quickstart-extras/tree/master/roles/modify-image

this brings us to the actual failure:

* https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/modify-image/tasks/libguestfs.yml#L51

TASK [modify-image : Run virt-customize on the provided image] *************...

Read more...

Changed in tripleo:
status: Fix Released → Triaged
Matt Young (halcyondude) wrote :

We also don't appear to capture the builder log from this step:

https://github.com/openstack/tripleo-quickstart/blob/master/roles/convert-image/tasks/main.yml#L50

```
- name: collect diagnostic log from undercloud image
  shell: >
    virt-cat -a undercloud.qcow2 /tmp/builder.log > builder-undercloud.log 2>&1;
  changed_when: true
  args:
    chdir: "{{ convert_image_working_dir }}"
  environment:
    LIBGUESTFS_BACKEND: direct
    LIBVIRT_DEFAULT_URI: "{{ libvirt_uri }}"
```

Example:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/dac32bd

The output from the VC command is spammed to the console log, and with some work can be made usable. Ideally part of fixing this issue is capturing builder-undercloud.log, as this is the canonical log of all changes to the UC

Changed in tripleo:
assignee: nobody → Matt Young (halcyondude)
importance: Critical → High
Matt Young (halcyondude) wrote :

moved this to "high" as no jobs are failing at the moment.

Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.