Ovb jobs failing with RETRY_LIMIT while creating a stack giving 504 Gateway Time-out openresty/1.15.8.2

Bug #1920101 reported by chandan kumar
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

On 19th Mar, Most of the ovb jobs in component/integration/check pipeline are failing with RETRY_LIMIT: https://review.rdoproject.org/zuul/builds?result=RETRY_LIMIT

https://logserver.rdoproject.org/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-train/cc726bd/job-output.txt

```
2021-03-19 03:53:39.354585 | TASK [ovb-manage : Create a stack]
2021-03-19 03:54:42.266040 | primary | Traceback (most recent call last):
2021-03-19 03:54:42.266630 | primary | File "/home/zuul/workspace/ovb/openstack-virtual-baremetal/bin/deploy.py", line 352, in <module>
2021-03-19 03:54:42.266662 | primary | _deploy(stack_name, stack_template, env_paths, poll=poll)
2021-03-19 03:54:42.266687 | primary | File "/home/zuul/workspace/ovb/openstack-virtual-baremetal/bin/deploy.py", line 219, in _deploy
2021-03-19 03:54:42.266694 | primary | parameters=parameters)
2021-03-19 03:54:42.266700 | primary | File "/home/zuul/workspace/ovb/.venv/lib/python3.6/site-packages/heatclient/v1/stacks.py", line 171, in create
2021-03-19 03:54:42.266705 | primary | data=kwargs, headers=headers)
2021-03-19 03:54:42.266709 | primary | File "/home/zuul/workspace/ovb/.venv/lib/python3.6/site-packages/keystoneauth1/adapter.py", line 381, in post
2021-03-19 03:54:42.266714 | primary | return self.request(url, 'POST', **kwargs)
2021-03-19 03:54:42.266718 | primary | File "/home/zuul/workspace/ovb/.venv/lib/python3.6/site-packages/heatclient/common/http.py", line 323, in request
2021-03-19 03:54:42.266723 | primary | raise exc.from_response(resp)
2021-03-19 03:54:42.266735 | primary | heatclient.exc.HTTPException: ERROR: b'<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body>\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>openresty/1.15.8.2</center>\r\n</body>\r\n</html>\r\n'
2021-03-19 03:54:52.593535 | primary | ERROR
2021-03-19 03:54:52.593968 | primary | {
2021-03-19 03:54:52.594061 | primary | "delta": "0:01:02.298660",
2021-03-19 03:54:52.594122 | primary | "end": "2021-03-19 03:54:42.354617",
2021-03-19 03:54:52.594179 | primary | "msg": "non-zero return code",
2021-03-19 03:54:52.594233 | primary | "rc": 1,
2021-03-19 03:54:52.594288 | primary | "start": "2021-03-19 03:53:40.055957"
2021-03-19 03:54:52.594342 | primary | }
2021-03-19 03:54:52.634787 |
2021-03-19 03:54:52.635037 | TASK [ovb-manage : Show last stack status]
2021-03-19 03:57:55.125328 | primary | ERROR: b'<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body>\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>openresty/1.15.8.2</center>\r\n</body>\r\n</html>\r\n'
2021-03-19 03:57:55.486039 | primary | ok: Runtime: 0:03:01.867875
2021-03-19 03:57:55.530320 |
2021-03-19 03:57:55.530756 | TASK [ovb-manage : Delete CREATE_FAILED stack]
2021-03-19 03:58:06.095337 | primary | ERROR
2021-03-19 03:58:06.095703 | primary | {
2021-03-19 03:58:06.095773 | primary | "msg": "The conditional check '\"CREATE_FAILED\" in result.stdout' failed. The error was: error while evaluating conditional (\"CREATE_FAILED\" in result.stdout): 'result' is undefined\n\nThe error appears to be in '/var/lib/zuul/builds/cc726bd8dd95450185f6bb0ba48b061f/trusted/project_0/review.rdoproject.org/config/roles/ovb-manage/tasks/ovb-create-stack.yml': line 151, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Delete CREATE_FAILED stack\n ^ here\n"
2021-03-19 03:58:06.095825 | primary | }
2021-03-19 03:58:06.112564 |
```
It is failing at this part: https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/roles/ovb-manage/tasks/ovb-create-stack.yml#L37

Logging this bug for further investigation.

Below is the list of other jobs failed:
https://logserver.rdoproject.org/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-ussuri/8de6615/job-output.txt

https://logserver.rdoproject.org/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-victoria/c019564/job-output.txt

https://logserver.rdoproject.org/openstack-component-clients/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-clients-master/031951e/job-output.txt

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/b553297/job-output.txt

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-master/d3fa56c/job-output.txt

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/96e40c9/job-output.txt

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-master/387cf53/job-output.txt

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/c77ba99/job-output.txt

https://logserver.rdoproject.org/90/781590/1/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/32f7697/job-output.txt

https://logserver.rdoproject.org/64/781564/6/openstack-check/tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset001/df01930/job-output.txt

https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-baremetal-train/bd57d4f/job-output.txt

https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-baremetal-ussuri/d5e8d25/job-output.txt

https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-baremetal-victoria/a00e368/job-output.txt

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
daniel.pawlik (daniel-pawlik) wrote :

Created a ticket in Vexxhost.
We can assume from "stack_create_complete" metrics in our Prometheus system when the service crashed:

https://prometheus.monitoring.softwarefactory-project.io/prometheus/graph?g0.expr=stack_create_complete&g0.tab=0&g0.stacked=1&g0.range_input=2d

Revision history for this message
wes hayutin (weshayutin) wrote :

after getting the heat issue fixed in vexhost and jobs moving from RETRY to entry into the deployment jobs are now failing in introspection across all branches.

https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset001&job_name=tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001&job_name=tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035

Revision history for this message
wes hayutin (weshayutin) wrote :

2021-03-19 15:06:43.584374 | primary | TASK [overcloud-prep-images : Wait until nodes will be manageable] *************
2021-03-19 15:06:43.584380 | primary | Friday 19 March 2021 15:06:43 +0000 (0:00:00.087) 0:40:02.469 **********
2021-03-19 15:06:48.729196 | primary | FAILED - RETRYING: Wait until nodes will be manageable (10 retries left).
2021-03-19 15:07:23.477153 | primary | FAILED - RETRYING: Wait until nodes will be manageable (9 retries left).
2021-03-19 15:07:57.125936 | primary | FAILED - RETRYING: Wait until nodes will be manageable (8 retries left).
2021-03-19 15:08:30.685535 | primary | FAILED - RETRYING: Wait until nodes will be manageable (7 retries left).
2021-03-19 15:09:04.688520 | primary | FAILED - RETRYING: Wait until nodes will be manageable (6 retries left).
2021-03-19 15:09:39.078475 | primary | FAILED - RETRYING: Wait until nodes will be manageable (5 retries left).
2021-03-19 15:10:12.662236 | primary | FAILED - RETRYING: Wait until nodes will be manageable (4 retries left).
2021-03-19 15:10:46.204220 | primary | FAILED - RETRYING: Wait until nodes will be manageable (3 retries left).
2021-03-19 15:11:19.868487 | primary | FAILED - RETRYING: Wait until nodes will be manageable (2 retries left).
2021-03-19 15:11:53.655143 | primary | FAILED - RETRYING: Wait until nodes will be manageable (1 retries left).
2021-03-19 15:12:27.535707 | primary | fatal: [undercloud]: FAILED! => {
2021-03-19 15:12:27.536324 | primary | "attempts": 10,
2021-03-19 15:12:27.536375 | primary | "changed": false,
2021-03-19 15:12:27.536392 | primary | "cmd": "set -o pipefail && openstack --os-cloud undercloud baremetal node list -f value -c \"Provisioning State\" | grep -v -e manageable -e available",
2021-03-19 15:12:27.536434 | primary | "delta": "0:00:03.375947",
2021-03-19 15:12:27.536456 | primary | "end": "2021-03-19 15:12:27.472662",
2021-03-19 15:12:27.536486 | primary | "failed_when_result": true,
2021-03-19 15:12:27.536515 | primary | "rc": 0,
2021-03-19 15:12:27.536537 | primary | "start": "2021-03-19 15:12:24.096715"

Revision history for this message
wes hayutin (weshayutin) wrote :

2021-03-19 15:55:01.747322 | primary | TASK [overcloud-prep-images : Import and register overcloud nodes - legacy] ****
2021-03-19 15:55:01.747433 | primary | Friday 19 March 2021 15:55:01 +0000 (0:00:20.149) 0:56:56.724 **********
2021-03-19 15:59:31.041992 | primary | fatal: [undercloud]: FAILED! => {
2021-03-19 15:59:31.042059 | primary | "changed": true,
2021-03-19 15:59:31.042073 | primary | "cmd": "set -o pipefail && /home/zuul/overcloud-import-nodes.sh 2>&1 | awk '{ print strftime(\"%Y-%m-%d %H:%M:%S |\"), $0; fflush(); }' > /home/zuul/overcloud_import_nodes.log\n",
2021-03-19 15:59:31.042646 | primary | "delta": "0:04:28.936088",
2021-03-19 15:59:31.042700 | primary | "end": "2021-03-19 15:59:31.012214",
2021-03-19 15:59:31.042713 | primary | "rc": 1,
2021-03-19 15:59:31.042723 | primary | "start": "2021-03-19 15:55:02.076126"
2021-03-19 15:59:31.042733 | primary | }
2021-03-19 15:59:31.042744 | primary |

Revision history for this message
wes hayutin (weshayutin) wrote :

'ipmitool -I lanplus -H 10.0.1.130 -L ADMINISTRATOR -U admin -R 1 -N 5 -f /tmp/tmpwdusu9uz power status' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474
2021-03-19 16:19:34.252 7 WARNING ironic.drivers.modules.ipmitool [req-9d458dfd-c161-42ae-8391-4a978864e038 8e8a1da3d7774d2188e7003c69669912 80a1d32d74944d5c938bda39a8943ddd - default default] IPMI Error encountered, retrying "ipmitool -I lanplus -H 10.0.1.130 -L ADMINISTRATOR -U admin -R 1 -N 5 -f /tmp/tmpwdusu9uz power status" for node d5acd169-7f18-427e-8c4d-7d2d59adec5e. Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 10.0.1.130 -L ADMINISTRATOR -U admin -R 1 -N 5 -f /tmp/tmpwdusu9uz power status
Exit code: 1
Stdout: ''

https://logserver.rdoproject.org/69/781769/1/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/9c47562/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

Revision history for this message
chandan kumar (chkumar246) wrote :

https://review.rdoproject.org/r/c/rdo-jobs/+/32627 to get more info why infra is behaving like this.

Revision history for this message
yatin (yatinkarel) wrote :

bmc node is not getting reachable to metadata service(util.py[WARNING]: No active metadata service found):-
https://logserver.rdoproject.org/69/781769/1/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/9c47562/logs/bmc_1_64270-console.log

Issue is filed with vexx for getting root cause of the issue and have a fix.

Revision history for this message
wes hayutin (weshayutin) wrote :
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
yatin (yatinkarel) wrote :

Some context on the fix for the metadata issue:-
<nhicher_> mnaser: what was the issue?
<mnaser> nhicher_: we've added an ip routing rule that seems to have routed things incorrect but didn't catch it inside monitoring
<nhicher_> mnaser: ok, thanks

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.