tripleo-ci-centos-8-standalone-tv-validation failing the tripleo-validations/stable/train due to log path mismatch.

Bug #1933515 reported by Jiri Podivin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
mathieu bultel

Bug Description

Job is failing due to mismatch in the log path determination between the validations-common, validations-libs and tripleo-validations.

Trace:
------

2021-06-25 17:22:50.053216 | primary | TASK [Run validations] *********************************************************
2021-06-25 17:22:50.053306 | primary | Friday 25 June 2021 17:22:50 +0000 (0:00:00.211) 0:42:42.243 ***********
2021-06-25 17:22:55.116995 | primary | fatal: [undercloud]: FAILED! => {
2021-06-25 17:22:55.117090 | primary | "changed": true,
2021-06-25 17:22:55.117121 | primary | "cmd": "validation run --validation check-cpu --validation-dir /usr/share/ansible/validation-playbooks --inventory tripleo-ansible-inventory.yaml --output-log validation_check-cpu.log --extra-vars minimal_cpu_count=2 --extra-env-vars ANSIBLE_STDOUT_CALLBACK=default",
2021-06-25 17:22:55.117145 | primary | "delta": "0:00:04.550239",
2021-06-25 17:22:55.117167 | primary | "end": "2021-06-25 17:22:55.086073",
2021-06-25 17:22:55.117205 | primary | "rc": 1,
2021-06-25 17:22:55.117229 | primary | "start": "2021-06-25 17:22:50.535834"
2021-06-25 17:22:55.117250 | primary | }
2021-06-25 17:22:55.117271 | primary |
2021-06-25 17:22:55.117294 | primary | STDOUT:
2021-06-25 17:22:55.117343 | primary |
2021-06-25 17:22:55.117368 | primary | [WARNING]: Skipping key (deprecated) in group (overcloud) as it is not a
2021-06-25 17:22:55.117388 | primary | mapping, it is a <class 'ansible.parsing.yaml.objects.AnsibleUnicode'>
2021-06-25 17:22:55.117409 | primary | [WARNING]: Found both group and host with same name: standalone
2021-06-25 17:22:55.117430 | primary |
2021-06-25 17:22:55.117451 | primary | PLAY [localhost] ***************************************************************
2021-06-25 17:22:55.117472 | primary |
2021-06-25 17:22:55.117493 | primary | TASK [check_cpu : Gather facts] ************************************************
2021-06-25 17:22:55.117529 | primary | Friday 25 June 2021 17:22:53 +0000 (0:00:00.066) 0:00:00.066 ***********
2021-06-25 17:22:55.117553 | primary | ok: [localhost]
2021-06-25 17:22:55.117574 | primary |
2021-06-25 17:22:55.117595 | primary | TASK [check_cpu : Verify the number of CPU cores] ******************************
2021-06-25 17:22:55.117619 | primary | Friday 25 June 2021 17:22:54 +0000 (0:00:01.260) 0:00:01.326 ***********
2021-06-25 17:22:55.117640 | primary | ok: [localhost]
2021-06-25 17:22:55.117660 | primary |
2021-06-25 17:22:55.117680 | primary | PLAY RECAP *********************************************************************
2021-06-25 17:22:55.117706 | primary | localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
2021-06-25 17:22:55.117728 | primary | Friday 25 June 2021 17:22:54 +0000 (0:00:00.172) 0:00:01.499 ***********
2021-06-25 17:22:55.117749 | primary | ===============================================================================
2021-06-25 17:22:55.117769 | primary | check_cpu : Gather facts ------------------------------------------------ 1.26s
2021-06-25 17:22:55.117790 | primary | check_cpu : Verify the number of CPU cores ------------------------------ 0.17s
2021-06-25 17:22:55.117811 | primary |
2021-06-25 17:22:55.117831 | primary |
2021-06-25 17:22:55.117851 | primary | STDERR:
2021-06-25 17:22:55.117871 | primary |
2021-06-25 17:22:55.117892 | primary | No validation has been run, please check log in the Ansible working directory.
2021-06-25 17:22:55.117912 | primary |
2021-06-25 17:22:55.117932 | primary |
2021-06-25 17:22:55.117952 | primary | MSG:
2021-06-25 17:22:55.117973 | primary |
2021-06-25 17:22:55.117993 | primary | non-zero return code

Examples of failed jobs:
------------------------

https://zuul.opendev.org/t/openstack/build/8a7327fccfa245a9866bb177c7b9838c

https://zuul.opendev.org/t/openstack/build/1784bdf0f8b0400aab9b0fcce1d2d0ab

The underlying issue has been known for a significant amount of time, and there were several attempts to fix it.

In short, VF determines which validations were run, and the results, by reading stored logs. Originally there was no notification about failed attempt to write the logs, and so the first issues became apparent only during the subsequent retrieval.

Recent patches largely removed this issue, most notably the https://review.opendev.org/c/openstack/validations-libs/+/795093

Unfortunately there are still inconsistencies in the behavior, between the validations-libs, validations-common and tripleo-validations.
That, as it emerged during the discussion, were irreconcilable.

Unfortunately, one patch attempting to unify behavior across VF was merged anyway and created the issue detailed.

Jiri Podivin (jpodivin)
Changed in tripleo:
assignee: nobody → mathieu bultel (mat-bultel)
wes hayutin (weshayutin)
Changed in tripleo:
milestone: none → xena-rc1
tags: added: promotion-blocker
Revision history for this message
wes hayutin (weshayutin) wrote :

http://paste.openstack.org/show/806988/

https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-standalone-tv-validation&branch=stable%2Ftrain

2021-06-25 17:22:55.117851 | primary | STDERR:
2021-06-25 17:22:55.117871 | primary |
2021-06-25 17:22:55.117892 | primary | No validation has been run, please check log in the Ansible working directory.
2021-06-25 17:22:55.117912 | primary |
2021-06-25 17:22:55.117932 | primary |
2021-06-25 17:22:55.117952 | primary | MSG:
2021-06-25 17:22:55.117973 | primary |
2021-06-25 17:22:55.117993 | primary | non-zero return code

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Jiri Podivin (jpodivin) wrote :
Jiri Podivin (jpodivin)
description: updated
Jiri Podivin (jpodivin)
description: updated
Revision history for this message
Jiri Podivin (jpodivin) wrote :
Revision history for this message
Jiri Podivin (jpodivin) wrote :

The matter of the VF log directory is a long standing issue, and it is, to a significant extend, my fault it is so.

Since the start we have seriously misjudged how important, and prone to bugs, the VF logging behavior actually is. And how different is the approach to logs between different actions of the VF (run, list, show etc.).

The attempts at fix were numerous, and all only partly successful. The default location, and fallback location, of the log directory are defined in several different ways across tripleo repositories:

https://opendev.org/openstack/validations-common/src/branch/master/validations_common/callback_plugins/validation_json.py#L40

https://opendev.org/openstack/validations-libs/src/branch/master/validations_libs/utils.py#L35

https://opendev.org/openstack/python-tripleoclient/src/branch/master/tripleoclient/constants.py#L124

Coordination of the patches between the project is complicated, especially when it comes to build order of the packages and stable branches.

As we have no spec for the logging behavior and no document describing our approach, the final behavior is subject to change even well into the the review process, with unfortunate impact on the code quality.

wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-validations (stable/train)

Change abandoned by "Gael Chamoulaud <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/tripleo-validations/+/797859

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.