Ocata: Overcloud failed because of "WebSocket timed out" in CI

Bug #1780183 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
High
Sagi (Sergey) Shnaidman

Bug Description

tripleo-ci-centos-7-nonha-multinode-oooq jobs is failing in Deploy the overcloud step.
http://logs.openstack.org/39/577339/1/gate/tripleo-ci-centos-7-nonha-multinode-oooq/e97fdad/job-output.txt.gz#_2018-07-05_04_11_28_429219

Below is the logs from overcloud_deploy.log.txt.gz http://logs.openstack.org/39/577339/1/gate/tripleo-ci-centos-7-nonha-multinode-oooq/e97fdad/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2018-07-05 04:11:32 | ++ export OS_CLOUDNAME
2018-07-05 04:11:32 | + openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates --libvirt-type qemu --timeout 89 -e /home/zuul/cloud-names.yaml -e /home/zuul/hostnamemap.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/deployed-server-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/deployed-server-bootstrap-environment-centos.yaml --overcloud-ssh-user zuul -e /usr/share/openstack-tripleo-heat-templates/ci/environments/multinode.yaml -e /home/zuul/overcloud_network_params.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml -e /opt/stack/new/tripleo-ci/test-environments/worker-config.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/debug.yaml --validation-errors-nonfatal --roles-file /home/zuul/overcloud_roles.yaml --compute-scale 0
2018-07-05 04:11:35 | The disable_upgrade_deployment flag is not set in the roles file. This flag is expected when you have a nova-compute or swift-storage role. Please check the contents of the roles file: [{'networks': ['External', 'InternalApi', 'Storage', 'StorageMgmt', 'Tenant'], 'CountDefault': 1, 'name': 'Controller', 'tags': ['primary', 'controller']}]
2018-07-05 04:11:38 | No image with the name 'bm-deploy-kernel' found - make sure you've uploaded boot images
2018-07-05 04:11:38 | No image with the name 'bm-deploy-ramdisk' found - make sure you've uploaded boot images
2018-07-05 04:11:39 | Expected hypervisor stats not met
2018-07-05 04:11:39 | Not enough nodes - available: 0, requested: 1
2018-07-05 04:11:39 | Configuration has 4 errors, fix them before proceeding. Ignoring these errors is likely to lead to a failed deploy.
2018-07-05 04:18:26 | Timed out waiting for messages from Execution (ID: 3c3b9bde-bda3-4e47-b5c3-acc25ee4215f, State: SUCCESS). The Workflow finished successfully but no messages were received before the WebSocket timed out.
2018-07-05 04:18:26 |
2018-07-05 04:18:26 | Removing the current plan files
2018-07-05 04:18:26 | Uploading new plan files
2018-07-05 04:18:26 | Started Mistral Workflow tripleo.plan_management.v1.update_deployment_plan. Execution ID: 3c3b9bde-bda3-4e47-b5c3-acc25ee4215f
2018-07-05 04:18:27 | + status_code=1
2018-07-05 04:18:27 | + openstack stack list
2018-07-05 04:18:27 | + grep -q overcloud
2018-07-05 04:18:31 | + echo 'overcloud deployment not started. Check the deploy configurations'
2018-07-05 04:18:31 | overcloud deployment not started. Check the deploy configurations
2018-07-05 04:18:31 | + exit 1

The same issue is seen in another jobs also http://logs.openstack.org/91/564291/15/check/tripleo-ci-centos-7-scenario003-multinode-oooq/f44bc41/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-04_09_39_36

There might be some fishy happening in mistral and zaqar.
Please have a look.

Changed in tripleo:
milestone: none → rocky-3
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
status: New → Triaged
Revision history for this message
chandan kumar (chkumar246) wrote :
Changed in tripleo:
importance: Undecided → High
summary: - Overcloud failed to Deploy in tripleo-ci-centos-7-nonha-multinode-oooq
+ Overcloud failed because of "WebSocket timed out" in tripleo-ci-
+ centos-7-nonha-multinode-oooq
description: updated
summary: - Overcloud failed because of "WebSocket timed out" in tripleo-ci-
+ Ocata: Overcloud failed because of "WebSocket timed out" in tripleo-ci-
centos-7-nonha-multinode-oooq
summary: - Ocata: Overcloud failed because of "WebSocket timed out" in tripleo-ci-
- centos-7-nonha-multinode-oooq
+ Ocata: Overcloud failed because of "WebSocket timed out" in CI
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Zaqar errors:
http://logs.openstack.org/91/564291/15/check/tripleo-ci-centos-7-scenario003-multinode-oooq/f44bc41/logs/undercloud/var/log/zaqar/zaqar.log.txt.gz

2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook [-] webhook task got exception: HTTPConnectionPool(host='centos-7-vexxhost-ca-ymq-1-0000540723', port=39458): Max retries exceeded with url: /7ccefb06-ff53-4e19-a69b-acf83fbcf3f6 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fc11fbbf750>: Failed to establish a new connection: [Errno 111] Connection refused',)).
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook Traceback (most recent call last):
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook File "/usr/lib/python2.7/site-packages/zaqar/notification/tasks/webhook.py", line 44, in execute
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook headers=headers)
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook File "/usr/lib/python2.7/site-packages/requests/api.py", line 110, in post
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook return request('post', url, data=data, json=json, **kwargs)
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook File "/usr/lib/python2.7/site-packages/requests/api.py", line 56, in request
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook return session.request(method=method, url=url, **kwargs)
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook resp = self.send(prep, **send_kwargs)
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook r = adapter.send(request, **kwargs)
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 487, in send
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook raise ConnectionError(e, request=request)
2018-07-04 09:40:40.883 10268 ERROR zaqar.notification.tasks.webhook ConnectionError: HTTPConnectionPool(host='centos-7-vexxhost-ca-ymq-1-0000540723', port=39458): Max retries exceeded with url: /7ccefb06-ff53-4e19-a69b-acf83fbcf3f6 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fc11fbbf750>: Failed to establish a new connection: [Errno 111] Connection refused',))

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Possible firewall issue, zaqar has only rule:

http://logs.openstack.org/91/564291/15/check/tripleo-ci-centos-7-scenario003-multinode-oooq/f44bc41/logs/undercloud/var/log/extra/network.txt.gz
-A INPUT -p tcp -m multiport --dports 9000 -m comment --comment "134 zaqar websockets ipv4" -m state --state NEW -j ACCEPT

but tries to connect to 39458 port, which is listening:
http://logs.openstack.org/91/564291/15/check/tripleo-ci-centos-7-scenario003-multinode-oooq/f44bc41/logs/undercloud/var/log/extra/netstat.txt.gz

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.