Pacemaker performance causes intermittent galera issues in loaded CI env

Bug #1987092 reported by Jiri Podivin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Since 18.8.2022 the job consistently fails during Check Keystone public endpoint status step, after all 30 attempts are exhausted.

The keystone container logs don't contain errors. However around the same time the "error: kex_exchange_identification: Connection closed by remote host" appears in the journal on the undercloud.
Analogous issue can be observed on periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation

Trace:
------
2022-08-18 19:57:01.330006 | fa163e3e-f896-e3f5-81db-0000000034d7 | WAITING | Check Keystone public endpoint status | undercloud | 6 retries left
2022-08-18 19:57:08.874973 | fa163e3e-f896-e3f5-81db-0000000034d7 | WAITING | Check Keystone public endpoint status | undercloud | 5 retries left
2022-08-18 19:57:14.046036 | fa163e3e-f896-e3f5-81db-0000000034d7 | WAITING | Check Keystone public endpoint status | undercloud | 4 retries left
2022-08-18 19:57:19.201638 | fa163e3e-f896-e3f5-81db-0000000034d7 | WAITING | Check Keystone public endpoint status | undercloud | 3 retries left
2022-08-18 19:57:24.350340 | fa163e3e-f896-e3f5-81db-0000000034d7 | WAITING | Check Keystone public endpoint status | undercloud | 2 retries left
2022-08-18 19:57:29.508238 | fa163e3e-f896-e3f5-81db-0000000034d7 | WAITING | Check Keystone public endpoint status | undercloud | 1 retries left
2022-08-18 19:57:34.666224 | fa163e3e-f896-e3f5-81db-0000000034d7 | FATAL | Check Keystone public endpoint status | undercloud | item=neutron | error={"ansible_job_id": "345769691811.194096", "ansible_loop_var": "tripleo_keystone_resources_endpoint_async_result_item", "attempts": 30, "changed": false, "finished": 0, "results_file": "/root/.ansible_async/345769691811.194096", "started": 1, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": [], "tripleo_keystone_resources_endpoint_async_result_item": {"ansible_job_id": "345769691811.194096", "ansible_loop_var": "tripleo_keystone_resources_data", "changed": true, "failed": 0, "finished": 0, "results_file": "/root/.ansible_async/345769691811.194096", "started": 1, "tripleo_keystone_resources_data": {"key": "neutron", "value": {"endpoints": {"admin": "http://192.168.24.3:9696", "internal": "http://192.168.24.3:9696", "public": "http://192.168.24.3:9696"}, "region": "regionOne", "service": "network", "users": {"neutron": {"password": "zZWQAqqm4VQlQdSUmidoLxQvO", "roles": ["admin", "service"]}}}}}}

Logs:
-----
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/25a8020/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-standalone-validation-master/ed5ad21/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/25a8020/logs/undercloud/var/log/extra/journal_errors.txt.gz

Revision history for this message
Jiri Podivin (jpodivin) wrote :

Logs of neutron container of the multinode job contain multiple error raised by keystone.

Trace:
------
 keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at http://192.168.24.3:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.

Log:
----
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/25a8020/logs/undercloud/var/log/extra/errors.txt.gz

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → zed-1
Revision history for this message
Jiri Podivin (jpodivin) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote (last edit ):

periodic-tripleo-ci-centos-9-standalone-validation-master is in RETRY state consistently for three runs[1]

IMHO, before closing this out we need to confirm wether the job is stable enough

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-standalone-validation-master&skip=0

Rekicked here:- https://review.rdoproject.org/r/c/testproject/+/44593

Revision history for this message
Jiri Podivin (jpodivin) wrote :
Download full text (9.2 KiB)

The most recent execution of the job in test project failed as well.
This time with 'Gateway Timeout' during the 'Clean up legacy Cinder keystone catalog entries' task, but still during overcloud deploy and with the error raised by keystone code.

Since the "Clean up legacy Cinder keystone catalog entries" precedes "Check Keystone public endpoint status" which was the point of original failure, it's not possible to determine if the original issue was resolved, or if it's obfuscated by the new one.

However, given that both tasks follow each other in quick succession, and that error is in both cases raised from within keystone code, it is not implausible that both issues are connected in some fashion.

Trace:
------
2022-08-22 07:07:02 | 2022-08-22 07:07:02.718837 | fa163e0a-00e1-a192-ef98-0000000092a1 | FATAL | Clean up legacy Cinder keystone catalog entries | undercloud | item={'service_name': 'cinderv2', 'service_type': 'volumev2'} | error={"ansible_index_var": "cinder_api_service", "ansible_loop_var": "item", "changed": false, "cinder_api_service": 0, "item": {"service_name": "cinderv2", "service_type": "volumev2"}, "module_stderr": "Traceback (most recent call last):\n File \"<stdin>\", line 107, in <module>\n File \"<stdin>\", line 99, in _ansiballz_main\n File \"<stdin>\", line 47, in invoke_module\n File \"/usr/lib64/python3.9/runpy.py\", line 225, in run_module\n return _run_module_code(code, init_globals, run_name, mod_spec)\n File \"/usr/lib64/python3.9/runpy.py\", line 97, in _run_module_code\n _run_code(code, mod_globals, init_globals,\n File \"/usr/lib64/python3.9/runpy.py\", line 87, in _run_code\n exec(code, run_globals)\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 185, in <module>\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 181, in main\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/module_utils/openstack.py\", line 407, in __call__\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 140, in run\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 517, in search_services\n services = self.list_services()\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 492, in list_services\n if self._is_client_version('identity', 2):\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/openstackcloud.py\", line 460, in _is_client_version\n client = getattr(self, client_name)\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 31, in _identity_client\n self._raw_clients['identity'] = self._get_versioned_client(\n File \"/usr/lib/python3.9/sit...

Read more...

Revision history for this message
Takashi Kajinami (kajinamit) wrote (last edit ):

The failure mentioned in comment 5 looks related to galera which failed to start

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-1/var/log/containers/haproxy/haproxy.log.txt.gz
~~~
Aug 22 07:07:02 overcloud-controller-1 haproxy[7]: 10.0.0.1:50862 [22/Aug/2022:07:05:32.534] keystone_public~ keystone_public_be/overcloud-controller-1.internalapi.localdomain 0/0/0/81746/81746 504 397 - - ---- 1/1/0/0/0 0/0 "POST /v3/auth/tokens HTTP/1.1"
...

Aug 22 07:08:33 overcloud-controller-1 haproxy[7]: 10.0.0.1:48868 [22/Aug/2022:07:07:03.548] keystone_public~ keystone_public_be/overcloud-controller-1.internalapi.localdomain 0/0/0/82856/82856 504 397 - - ---- 1/1/0/0/0 0/0 "POST /v3/auth/tokens HTTP/1.1"
~~~

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-1/var/log/containers/keystone/keystone.log.txt.gz
~~~
2022-08-22 07:05:32.556 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -1 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2022-08-22 07:05:42.573 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -2 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2022-08-22 07:05:55.739 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -3 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
...
~~~

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
~~~
Failed Resource Actions:
  * galera 10s-interval monitor on galera-bundle-2 returned 'error' (local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:50 2022
  * galera 10s-interval monitor on galera-bundle-1 returned 'error' (local node <overcloud-controller-0> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:49 2022
  * galera promote on galera-bundle-0 returned 'error' (MySQL server failed to start (pid=2516) (rc=0), please check your installation) at Mon Aug 22 06:41:48 2022 after 33.062s
~~~

Revision history for this message
Ronelle Landy (rlandy) wrote :

iiuc, this is still an OVB issue - but not a standalone problem any more?

Revision history for this message
Jiri Podivin (jpodivin) wrote :

Indeed. That has seemingly recovered.

Revision history for this message
Jiri Podivin (jpodivin) wrote :
Revision history for this message
Jiri Podivin (jpodivin) wrote :
Download full text (5.2 KiB)

Comparing galera bundle logs from failing job[0] with those from successful run of sister job[1]
shows some divergence fairly early. With error:

2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) info: galera_start_0[56] error output [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]

Followed by others.

Trace:
------
2022-08-22T06:40:59.070682554+00:00 stderr F (log_execute) info: executing - rsc:galera action:start call_id:12
2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) info: galera_start_0[56] error output [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]
2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) info: galera_start_0[56] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:01.472239128+00:00 stderr F (log_finished) info: galera start (call 12, PID 56) exited with status 0 (execution time 2.401s)
2022-08-22T06:41:02.702300941+00:00 stderr F (log_op_output) info: galera_monitor_30000[621] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:06.535741387+00:00 stderr F (log_op_output) info: galera_monitor_20000[755] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:13.340246042+00:00 stderr F (cancel_recurring_action) info: Cancelling ocf operation galera_monitor_20000
2022-08-22T06:41:13.381264500+00:00 stderr F (cancel_recurring_action) info: Cancelling ocf operation galera_monitor_30000
2022-08-22T06:41:13.387622857+00:00 stderr F (log_execute) info: executing - rsc:galera action:promote call_id:176
2022-08-22T06:41:23.214099415+00:00 stderr F (log_op_output) info: galera_promote_0[898] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:23.214099415+00:00 stderr F (log_op_output) info: galera_promote_0[898] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:23.214099415+00:00 stderr F (log_finished) info: galera promote (call 176, PID 898) exited with status 0 (execution time 9.827s)
2022-08-22T06:41:49.722447502+00:00 stderr F (log_op_output) info: galera_monitor_10000[1718] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:41:59.811386227+00:00 stderr F (log_op_output) info: galera_monitor_10000[1779] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:42:12.085343779+00:00 stderr F (log_op_output) info: galera_monitor_10000[1839] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:42:22.216097374+00:00 stderr F (log_op_output) info: galera_monitor_10000[1899] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in...

Read more...

Revision history for this message
Luca Miccini (lmiccini2) wrote :
Download full text (10.0 KiB)

 * Container bundle set: galera-bundle [cluster.common.tag/mariadb:pcmklatest]:
    * galera-bundle-0 (ocf:heartbeat:galera): FAILED Promoted overcloud-controller-2 (blocked)
    * galera-bundle-1 (ocf:heartbeat:galera): Unpromoted overcloud-controller-0
    * galera-bundle-2 (ocf:heartbeat:galera): Unpromoted overcloud-controller-1

Failed Resource Actions:
  * galera 10s-interval monitor on galera-bundle-2 returned 'error' (local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:50 2022
  * galera 10s-interval monitor on galera-bundle-1 returned 'error' (local node <overcloud-controller-0> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:49 2022
  * galera promote on galera-bundle-0 returned 'error' (MySQL server failed to start (pid=2516) (rc=0), please check your installation) at Mon Aug 22 06:41:48 2022 after 33.062s

if we look a the controllers journals it looks like they are isolated from a network perspective for a few seconds (maybe it's a real network hiccup or maybe it is just an oversubscribed underlying hypervisor):

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-0/var/log/extra/journal.txt.gz

Aug 22 06:41:30 overcloud-controller-0 pacemaker-attrd[21566]: notice: Setting galera-no-grastate[overcloud-controller-1]: true -> (unset)
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-1.internalapi.localdomain is DOWN, reason: Layer4 timeout, check duration: 1001ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-2.internalapi.localdomain is DOWN, reason: Layer4 timeout, check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: backend mysql_be has no server available!
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [MAIN ] Corosync main process was not scheduled (@1661150499475) for 9392.5205 ms (threshold is 8520.0000 ms). Consider token timeout increase.
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] link: host: 3 link: 0 is down
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] link: host: 2 link: 0 is down
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 3 has no active links
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 2 has no active links

Aug 22 06:41:40 overcloud-controller-0 corosync[21550]: [TOTEM ] Token has not been received in 7987 ms
Aug 22 06:41:40 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-1.internalapi.loca...

Revision history for this message
Jiri Podivin (jpodivin) wrote :

I can confirm that this bug is also hitting wallaby version of the job.
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation

Logs:
-----
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation/309130d/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :

periodic-tripleo-ci-centos-9-standalone-validation-master is green now[1].

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-standalone-validation-master&skip=0

Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation is also green[2]

[2] https://review.rdoproject.org/zuul/build/125a8b5120b94f44aa092045bce575d1

Revision history for this message
Jiri Podivin (jpodivin) wrote :

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation is still failing. The master has passed once, so maybe it will get better too.

Revision history for this message
Jiri Podivin (jpodivin) wrote :

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation is now passing[3]. There is also second execution of the master job[4]

[3]https://review.rdoproject.org/zuul/build/628974f5963e43729a3e94b45848c674
[4]https://review.rdoproject.org/zuul/build/1afd570571be4bc5b94d87d62106f822

Revision history for this message
Jiri Podivin (jpodivin) wrote :

There was one successful run of the periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation on Saturday.

But since then there have been several failures[1]. Two due to 'extras-common' repo being unavailable. One during overcloud deploy on "Discovering nova hosts",[2] with 'Lost connection to MySQL server during query" as an error.

The job is still unstable. And controller-0 logs[[3] contain galera errors, so it could be the same issue.

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation&project=openstack/tripleo-ci
[2] https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/acda03f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/acda03f/logs/overcloud-controller-0/var/log/extra/journal_errors.txt.gz

Revision history for this message
Jakob Meng (jm1337) wrote :

The same galera issue can be observed in our periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby job as well.

overcloud-controller-2/var/log/containers/stdouts/galera-bundle.log.txt.gz [1]:

  2022-08-30T02:23:26.011709616+00:00 stderr F (log_op_output) info: galera_monitor_10000[7104] error output [ ocf-exit-reason:local node <overcloud-controller-2> is started, but not in primary mode. Unknown state. ]

overcloud-controller-2/var/log/containers/mysql/mysqld.log.txt.gz [2]:

  2022-08-30 2:23:31 0 [Note] /usr/libexec/mariadbd (initiated by: unknown): Normal shutdown

In this case, the job failed in [3]:

  TASK [os_tempest : Executing python-tempestconf]

with a completely different error [4]:

  2022-08-30 02:26:02.485 364416 INFO tempest.lib.common.rest_client [req-ade2427a-1652-4a3c-a44a-a8f661753257 ] Request (main): 500 GET https://10.0.0.5:13000/v3/projects 0.028s
  2022-08-30 02:26:02.485 364416 CRITICAL tempest [-] Unhandled error: tempest.lib.exceptions.ServerFault: Got server fault

while Tempest tries to access Keystone which fails because the db connection is closed [5]:

  2022-08-30 02:25:06.493 180 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

[1] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/stdouts/galera-bundle.log.txt.gz

[2] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/mysql/mysqld.log.txt.gz

[3] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/job-output.txt

[4] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/undercloud/var/log/tempest/tempestconf.log.txt.gz

[5] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/keystone/keystone.log.txt.gz

Jiri Podivin (jpodivin)
summary: - periodic-tripleo-ci-centos-9-standalone-validation-master fails on
- 'Check Keystone public endpoint status'
+ Pacemaker performance causes intermittent galera issues
summary: - Pacemaker performance causes intermittent galera issues
+ Pacemaker performance causes intermittent galera issues in loaded CI env
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/858945

Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby looks green now

Revision history for this message
Jiri Podivin (jpodivin) wrote :

We had 4 green executions in a row. So I would call it pretty stable.

Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :
Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Revision history for this message
Soniya Murlidhar Vyas (svyas) wrote :
Revision history for this message
Jakob Meng (jm1337) wrote :

Luca's patches [1] and [2] both do work properly but they have not been merged yet due to unrelated job failures. Ronelle's testproject [3] shows that both settings are taking effect [4],[5]. We do not know though, whether this really solves our intermittent issues. We are rekicking another testproject [6] to see whether we can reproduce this galera error even though we have Luca's patches applied. Lets cross fingers that those galera issues are really finally solved by Luca ☺️

[1] https://review.opendev.org/c/openstack/puppet-tripleo/+/859553
[2] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859568
[3] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712
[4] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712/comments/a41ba2b7_ed43e251
[5] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712/comments/2be73f3a_6ad0cc15
[6] https://review.rdoproject.org/r/c/testproject/+/45225

Revision history for this message
Ananya Banerjee (frenzyfriday) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/eb57ba758b69ca95650ea18d5846f28397411936
Submitter: "Zuul (22348)"
Branch: master

commit eb57ba758b69ca95650ea18d5846f28397411936
Author: Ronelle Landy <email address hidden>
Date: Wed Sep 28 15:25:14 2022 -0400

    Sets higher values for timeouts

     - corosync token_timeout: 30000
     - puppet pacemaker: evs.suspect_timeout=PT30S

    These higher values are an attempt to avoid
    slower operations hitting errors on overcloud
    deployments in OVB jobs.

    Related-Bug: #1987092
    Change-Id: Idd8170e435335566e9d1114675259d4d8d603364

Rabi Mishra (rabi)
tags: added: ovb
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861846
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/b58b1d6e9c4c920052d96fedc261cac8155f1c04
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit b58b1d6e9c4c920052d96fedc261cac8155f1c04
Author: Ronelle Landy <email address hidden>
Date: Wed Sep 28 15:25:14 2022 -0400

    Sets higher values for timeouts

     - corosync token_timeout: 30000
     - puppet pacemaker: evs.suspect_timeout=PT30S

    These higher values are an attempt to avoid
    slower operations hitting errors on overcloud
    deployments in OVB jobs.

    Related-Bug: #1987092
    Change-Id: Idd8170e435335566e9d1114675259d4d8d603364
    (cherry picked from commit eb57ba758b69ca95650ea18d5846f28397411936)

tags: added: in-stable-wallaby
Ronelle Landy (rlandy)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by "Ghanshyam <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/858945
Reason: TrieplO project is retiring now, for details, please see https://review.opendev.org/c/openstack/governance/+/905145 or reach out to OpenStack TC.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.