Randomly deployment fails on keystone related tasks

Bug #1980918 reported by Sandeep Yadav
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Medium
Unassigned

Bug Description

Hello,

I have noticed some random failures in CI this week during ruck/rover. These failures are inconsistent, frequency is not high, and this issue passes in rerun.

I suspect we are hitting these failures due to bad performance from the infra side but opening this launchpad to see if we can do anything on the deployment side to handle these issues better.

Few examples:-

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby/e608750/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

~~~
2022-07-07 00:00:00 | 2022-07-07 00:00:00.497997 | fa163eb9-a625-840d-0f4c-000000007a39 | FATAL | Check Keystone public endpoint status | undercloud | item=swift | error={"ansible_job_id": "183174399702.232944", "ansible_loop_var": "tripleo_keystone_resources_endpoint_async_result_item", "attempts": 2, "changed": false, "extra_data": {"data": null, "details": "misconfiguration and was unable to complete: and the actions you performed just before this error.: Internal Server Error: 500 Internal Server Error: Please contact the server administrator at: in the server error log.: The server encountered an internal error or: your request.: [no address given] to inform them of the time this error occurred,: More information about this error may be available", "response": "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n [no address given] to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n</body></html>\n"}, "finished": 1, "msg": "Failed to create endpoint for service swift: Server Error for url: https://10.0.0.5:13000/v3/endpoints, misconfiguration and was unable to complete: and the actions you performed just before this error.: Internal Server Error: 500 Internal Server Error: Please contact the server administrator at: in the server error log.: The server encountered an internal error or: your request.: [no address given] to inform them of the time this error occurred,: More information about this error may be available", "results_file": "/root/.ansible_async/183174399702.232944", "started": 1, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": [], "tripleo_keystone_resources_endpoint_async_result_item": {"ansible_job_id": "183174399702.232944", "ansible_loop_var": "tripleo_keystone_resources_data", "changed": true, "failed": 0, "finished": 0, "results_file": "/root/.ansible_async/183174399702.232944", "started": 1, "tripleo_keystone_resources_data": {"key": "swift", "value": {"endpoints": {"admin": "http://172.18.0.177:8080", "internal": "http:
~~~

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-wallaby/ad6c0a3/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
~~~
2022-07-06 23:51:59 | 2022-07-06 23:51:59.546525 | fa163e7d-6bcc-35bb-de36-000000005fde | FATAL | Check Keystone role status | undercloud | item=service | error={"ansible_job_id": "885033898683.218288", "ansible_loop_var": "tripleo_keystone_resources_role_async_result_item", "attempts": 30, "changed": false, "finished": 0, "results_file": "/root/.ansible_async/885033898683.218288", "started": 1, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": [], "tripleo_keystone_resources_role_async_result_item": {"ansible_job_id": "885033898683.218288", "ansible_loop_var": "tripleo_keystone_resources_role", "changed": true, "failed": 0, "finished": 0, "results_file": "/root/.ansible_async/885033898683.218288", "started": 1, "tripleo_keystone_resources_role": "service"}}
~~

Tags: ci
Revision history for this message
Rabi Mishra (rabi) wrote :
Download full text (5.4 KiB)

Sounds like network issue, mysql unavailble..

haproxy traceback:

Jul 6 23:59:59 overcloud-controller-1 haproxy[7]: 10.0.0.1:58400 [06/Jul/2022:23:59:59.805] keystone_public~ keystone_public/overcloud-controller-0.internalapi.localdomain 0/0/0/28/28 200 985 - - ---- 1/1/0/0/0 0/0 "GET /v3/endpoints HTTP/1.1"
Jul 7 00:00:00 overcloud-controller-1 haproxy[7]: 10.0.0.1:58400 [06/Jul/2022:23:59:59.841] keystone_public~ keystone_public/overcloud-controller-1.internalapi.localdomain 0/0/0/308/308 500 687 - - ---- 1/1/0/0/0 0/0 "POST /v3/endpoints HTTP/1.1"
Jul 7 00:03:01 overcloud-controller-1 haproxy[7]: Backup Server mysql/overcloud-controller-2.internalapi.localdomain is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 12ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 7 00:03:13 overcloud-controller-1 haproxy[7]: Backup Server mysql/overcloud-controller-2.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, check duration: 14ms. 0 active and 3 backup servers online. Running on backup. 0 sessions requeued, 0 total in queue.
Jul 7 00:03:46 overcloud-controller-1 haproxy[7]: Backup Server mysql/overcloud-controller-0.internalapi.localdomain is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 10ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 7 00:03:47 overcloud-controller-1 haproxy[7]: Backup Server mysql/overcloud-controller-2.internalapi.localdomain is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 17ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 7 00:03:47 overcloud-controller-1 haproxy[7]: Backup Server mysql/overcloud-controller-1.internalapi.localdomain is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 11ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 7 00:03:47 overcloud-controller-1 haproxy[7]: proxy mysql has no server available!

mariadb traceback:

2022-07-06 23:59:13 0 [Note] WSREP: (efe49719-b3f0, 'tcp://172.17.0.68:4567') connection to peer e6d87958-a9ec with addr tcp://172.17.0.69:4567 timed out, no messages seen in PT3S, socket stats: rtt: 5364 rttvar: 9533 rto: 1648000 lost: 1 last_data_recv: 3387 cwnd: 1 last_queued_since: 499910544 last_delivered_since: 3387023112 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2022-07-06 23:59:13 0 [Note] WSREP: (efe49719-b3f0, 'tcp://172.17.0.68:4567') turning message relay requesting on, nonlive peers: tcp://172.17.0.69:4567
2022-07-06 23:59:13 0 [Note] WSREP: (efe49719-b3f0, 'tcp://172.17.0.68:4567') connection to peer e39ccd47-8f20 with addr tcp://172.17.0.78:4567 timed out, no messages seen in PT3S, socket stats: rtt: 2058 rttvar: 3323 rto: 203000 lost: 0 last_data_recv: 3445 cwnd: 10 last_queued_since: 443915138 last_delivered_since: 3444051673 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0
2022-07-06 23:59:14...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.