hubbot check jobs are failing (timing out on OC deploy)

Bug #1771692 reported by Matt Young
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Critical
Unassigned

Bug Description

ROOT CAUSE IS NOT YET DETERMINED

---

sanity (check) job for master has been failing with OC deploy timeouts

https://review.openstack.org/#/c/567224

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-containers-multinode/a7052df/job-output.txt.gz#_2018-05-16_17_41_26_574265

2018-05-16 17:41:26.574265 | primary | TASK [overcloud-deploy : Deploy the overcloud] *********************************
2018-05-16 17:41:26.606048 | primary | Wednesday 16 May 2018 17:41:26 +0000 (0:00:00.112) 1:01:59.851 *********
2018-05-16 19:07:18.664433 | primary | +(./toci_quickstart.sh:101): exit_value=143

---

failing the same way (job timeout during OC deploy)

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-3nodes-multinode/58ecfce/job-output.txt.gz#_2018-05-16_17_26_35_648672

2018-05-16 17:26:35.648672 | primary | TASK [overcloud-deploy : Deploy the overcloud] *********************************
2018-05-16 17:26:35.692581 | primary | Wednesday 16 May 2018 17:26:35 +0000 (0:00:00.121) 0:00:41.116 *********
2018-05-16 19:05:51.509779 | primary | +(./toci_quickstart.sh:101): exit_value=143

---

The timeouts appear to have started on the recheck for the patch listed above (hubbot tqe gate job) May 15 3:05 PM

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-containers-multinode/369c645/job-output.txt.gz#_2018-05-16_01_26_29_930186

We've been nose down chasing promotions so just logging this now.

Tags: ci quickstart
Matt Young (halcyondude)
description: updated
description: updated
Matt Young (halcyondude)
description: updated
tags: added: alert
Revision history for this message
Matt Young (halcyondude) wrote :
Download full text (3.8 KiB)

notes from initial investigation follow (WIP)

-

# disk space and ram are both identified as being too low, unclear yet if this is what normal looks like

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-containers-multinode/a7052df/job-output.txt.gz#_2018-05-16_17_28_22_140352

TASK [tripleo-validations : Display failed validations tests]

fatal: [undercloud]: FAILED! => {"changed": false, "failed": true, "msg": ["### undercloud-disk-space FAILED ###", "Task 'Verify root disk space' failed:", "Host: localhost", "Message: The available space on the root partition is 31.1 GB, but it should be at least 60 GB.", "", "Failure! The validation failed for all hosts:", "* localhost", "", "### undercloud-ram FAILED ###", "Task 'Verify the RAM requirements' failed:", "Host: localhost", "Message: The RAM on the undercloud node is 7977 MB, the minimal recommended value is 16384 MB.", "", "Failure! The validation failed for all hosts:", "* localhost"]}
2018-05-16 17:28:22.319412 | primary | ...ignoring

-

additional UC validations are also failing

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-containers-multinode/a7052df/job-output.txt.gz#_2018-05-16_17_40_44_752465

{
    "changed": false,
    "failed": true,
    "msg": [
        "### ceilometerdb-size FAILED ###",
        "Task 'Check values' failed:",
        "Host: localhost",
        "Message: Value of metering_time_to_live is set to -1.",
        "",
        "Task 'Check values' failed:",
        "Host: localhost",
        "Message: Value of event_time_to_live is set to -1.",
        "",
        "Failure! The validation failed for all hosts:",
        "* localhost",
        "",
        "### deployment-images FAILED ###",
        "Task 'Fetch available images' failed:",
        "Host: localhost",
        "Message: Command `openstack image list --format value --column Name` exited with code: 1: non-zero return code",
        "",
        "stderr:",
        " (http://192.168.24.1:5000/v2.0/tokens): The resource could not be found. (HTTP 404) (Request-ID: req-e4ff4f49-5c9b-452f-9b9d-74ff05b1040b)",
        "",
        "Failure! The validation failed for all hosts:",
        "* localhost",
        "",
        "### switch-vlans FAILED ###",
        "Task 'Check that switch vlans are present if used in nic-config files' failed:",
        "Host: localhost",
        "Message: An unhandled exception occurred while running the lookup plugin 'introspection_data'. Error was a <class 'swiftclient.exceptions.ClientException'>, original message: Container GET failed: http://192.168.24.1:8080/v1/AUTH_ed317ea6680e42398bf3db09455d6b51/ironic-inspector?format=json 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<",
        "",
        "Failure! The validation failed for all hosts:",
        "* localhost",
        "",
        "### undercloud-debug FAILED ###",
        "Task 'Check the services for debug flag' failed:",
        "Host: localhost",
        "Message: The key 'debug' under the section 'DEFAULT' in file /etc/nova/nova.conf has the value: 'True'",
        "",
        "Task 'Check the services for debug flag' failed:"...

Read more...

description: updated
Revision history for this message
Matt Young (halcyondude) wrote :

prior runs of this job that didn't time out (but also failed) nearly double the free space (still failing validations). it doesn't make sense to me that these would be differently sized so dramatically

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-containers-multinode/5ae0e04/job-output.txt.gz#_2018-05-14_09_26_18_812900

### undercloud-disk-space FAILED ###", "Task 'Verify root disk space' failed:", "Host: localhost", "Message: The available space on the root partition is 51.1 GB, but it should be at least 60 GB.

-

the last actual passing version of this check job was on 11-may, and there again we've got 50+ free space (and still "failing" the validations)

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-containers-multinode/382ec74/job-output.txt.gz#_2018-05-11_09_05_20_819048

Message: The available space on the root partition is 51.1 GB, but it should be at least 60 GB.

-

The other bits of failing validation check are "normal" it would seem, as this (passing) job then goes on to ultimately succeed (validation failures notwithstanding)

http://logs.openstack.org/24/567224/1/check/tripleo-ci-centos-7-containers-multinode/382ec74/job-output.txt.gz#_2018-05-11_10_47_02_537614

summary: - hubbot check jobs are timing out on OC deploy
+ hubbot check jobs are failing (timing out on OC deploy)
description: updated
wes hayutin (weshayutin)
tags: removed: alert
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.