Timed out CI jobs not collecting logs, "FAILED with status: 137"

Bug #1731456 reported by Jiří Stránský on 2017-11-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
wes hayutin
Jiří Stránský (jistr) wrote :
Download full text (10.7 KiB)

Capturing log output snippet here:

2017-11-10 06:45:56.626016 | primary | +(./toci_quickstart.sh:72): exit_value=143
2017-11-10 06:45:56.646550 | primary | +(./toci_quickstart.sh:74): [[ 143 == 0 ]]
2017-11-10 06:45:56.676353 | primary | +(./toci_quickstart.sh:74): echo 'Playbook run failed'
2017-11-10 06:45:56.677848 | primary | Playbook run failed
2017-11-10 06:45:56.692674 | primary | +(./toci_quickstart.sh:78): sed -i 's/hosts: all:!localhost/hosts: all:!localhost:!127.0.0.2/' /home/zuul/workspace/.quickstart/playbooks/collect-logs.yml
2017-11-10 06:45:56.898621 | primary | +(./toci_quickstart.sh:81): ./quickstart.sh --no-clone --working-dir /home/zuul/workspace/.quickstart --retain-inventory --teardown none --extra-vars tripleo_root=/opt/stack/new --extra-vars working_dir=/home/zuul --extra-vars 'validation_args='\''--validation-errors-nonfatal'\''' --release tripleo-ci/master --extra-vars @/opt/stack/new/tripleo-ci/toci-quickstart/config/collect-logs.yml --tags all --nodes config/nodes/1ctlr.yml --environment /opt/stack/new/tripleo-ci/toci-quickstart/config/testenv/multinode.yml --extra-vars @config/general_config/featureset-multinode-common.yml --config config/general_config/featureset017.yml --extra-vars deploy_timeout=80 --playbook collect-logs.yml --extra-vars artcl_collect_dir=/home/zuul/workspace/logs 127.0.0.2
2017-11-10 06:45:57.303974 | primary | +(./quickstart.sh:474): export ANSIBLE_CONFIG=/opt/stack/new/tripleo-quickstart/ansible.cfg
2017-11-10 06:45:57.304200 | primary | +(./quickstart.sh:474): ANSIBLE_CONFIG=/opt/stack/new/tripleo-quickstart/ansible.cfg
2017-11-10 06:45:57.304386 | primary | +(./quickstart.sh:475): export ANSIBLE_INVENTORY=/home/zuul/workspace/.quickstart/hosts
2017-11-10 06:45:57.304521 | primary | +(./quickstart.sh:475): ANSIBLE_INVENTORY=/home/zuul/workspace/.quickstart/hosts
2017-11-10 06:45:57.304924 | primary | +(./quickstart.sh:476): export ARA_DATABASE=sqlite:////home/zuul/workspace/.quickstart/ara.sqlite
2017-11-10 06:45:57.307520 | primary | +(./quickstart.sh:476): ARA_DATABASE=sqlite:////home/zuul/workspace/.quickstart/ara.sqlite
2017-11-10 06:45:57.307866 | primary | +(./quickstart.sh:479): source /opt/stack/new/tripleo-quickstart/ansible_ssh_env.sh
2017-11-10 06:45:57.310461 | primary | ++(/opt/stack/new/tripleo-quickstart/ansible_ssh_env.sh:1): export OPT_WORKDIR=/home/zuul/workspace/.quickstart
2017-11-10 06:45:57.310709 | primary | ++(/opt/stack/new/tripleo-quickstart/ansible_ssh_env.sh:1): OPT_WORKDIR=/home/zuul/workspace/.quickstart
2017-11-10 06:45:57.310930 | primary | ++(/opt/stack/new/tripleo-quickstart/ansible_ssh_env.sh:4): export SSH_CONFIG=/home/zuul/workspace/.quickstart/ssh.config.ansible
2017-11-10 06:45:57.311099 | primary | ++(/opt/stack/new/tripleo-quickstart/ansible_ssh_env.sh:4): SSH_CONFIG=/home/zuul/workspace/.quickstart/ssh.config.ansible
2017-11-10 06:45:57.311501 | primary | ++(/opt/stack/new/tripleo-quickstart/ansible_ssh_env.sh:6): touch /home/zuul/workspace/.quickstart/ssh.config.ansible
2017-11-10 06:45:57.395062 | primary | ++(/opt/stack/new/tripleo-quickstart/ansible_ssh_env.sh:7): export 'ANSIBLE_SSH_ARGS=-F /home/zuul/workspace/.quickstart/ssh.config.ansible'...

tags: added: alert
Jiří Stránský (jistr) wrote :

The log collection might be taking too long or getting stuck...

Alex Schultz (alex-schultz) wrote :

Since our log collection occurs during our testing phase it can be killed if the overall process runs to long. In this case it appears that we were left with ~2 minutes or less to do all the log collection. This seems to be an overall issue with long deployment times and less so about log collection itself.

Jiří Stránský (jistr) wrote :

Right, the jobs would fail anyway, as they got killed by timeout. But even when this happened, we used to have a safe margin reserved for log collection -- we got logs even from time-outed jobs. This no longer seems to work well in the cases linked above. Maybe the safe margin we had got somehow shrunk by accident. The jobs reported 3 hr 4 min run length, while previously our jobs that timed out reported run lengths around 2 hr 50 min.

Alex Schultz (alex-schultz) wrote :

There was an instance of this where the log collection had 10 mins but still couldn't collect the logs in time. We should try and reduce the time it takes for log collection to run as much as possible.

wes hayutin (weshayutin) on 2017-11-13
tags: added: quickstart
wes hayutin (weshayutin) wrote :

Checking to see if this helps https://review.openstack.org/#/c/511526/

Changed in tripleo:
assignee: nobody → Ronelle Landy (rlandy)
wes hayutin (weshayutin) wrote :

FYI..
w/ patch 511526
legacy-tripleo-ci-centos-7-scenario002-multinode-oooq-puppet [1] took 2:48
w/o patch
legacy-tripleo-ci-centos-7-scenario002-multinode-oooq-puppet [2] took 3:46

w/ patch 511526
legacy-tripleo-ci-centos-7-ovb-ha-oooq [3] took 4:13
w/o patch
legacy-tripleo-ci-centos-7-ovb-ha-oooq [4] took 4:44

[1] http://logs.openstack.org/26/511526/6/check/legacy-tripleo-ci-centos-7-scenario002-multinode-oooq-puppet/c8d4899/logs/ara_oooq/
[2] http://logs.openstack.org/07/472607/123/check/legacy-tripleo-ci-centos-7-scenario002-multinode-oooq-puppet/ddc2beb/logs/ara_oooq/
[3] http://logs.openstack.org/26/511526/6/check-tripleo/legacy-tripleo-ci-centos-7-ovb-ha-oooq/c0c1978/logs/ara_oooq/
[4] http://logs.openstack.org/07/472607/123/check-tripleo/legacy-tripleo-ci-centos-7-ovb-ha-oooq/9fbd449/logs/ara_oooq/

tags: removed: alert
wes hayutin (weshayutin) wrote :

removing alert, noting that log collection is only taking a little over 4min max.
This is not an issue w/ log collection

wes hayutin (weshayutin) wrote :

https://review.openstack.org/#/c/511526/ has merged and is the agreed upon fix

Changed in tripleo:
assignee: Ronelle Landy (rlandy) → wes hayutin (weshayutin)
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers