Bug #1987092 “Pacemaker performance causes intermittent galera i...” : Bugs : tripleo

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-19:

#1

Logs of neutron container of the multinode job contain multiple error raised by keystone.

Trace:
------
keystoneauth1.exceptions.discovery.DiscoveryFailure: Unable to find a version discovery document at http://192.168.24.3:6385, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.

Log:
----
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/25a8020/logs/undercloud/var/log/extra/errors.txt.gz

Ronelle Landy (rlandy) on 2022-08-19

Changed in tripleo:
milestone:	none → zed-1

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-19:

#2

Keystone container on the controller-0 node has issues finding heat_stack domain.

Log:
----
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/25a8020/logs/overcloud-controller-0/var/log/containers/keystone/keystone.log.txt.gz

Revision history for this message

Ronelle Landy (rlandy) wrote on 2022-08-19:

#3

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-standalone-validation-master&skip=0 passed on rerun - can we close this out?

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-standalone-validation-master/211e8c6/

Revision history for this message

Soniya Murlidhar Vyas (svyas) wrote on 2022-08-22 (last edit on 2022-08-22):

#4

periodic-tripleo-ci-centos-9-standalone-validation-master is in RETRY state consistently for three runs[1]

IMHO, before closing this out we need to confirm wether the job is stable enough

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-standalone-validation-master&skip=0

Rekicked here:- https://review.rdoproject.org/r/c/testproject/+/44593

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-22:

#5

Download full text (9.2 KiB)

The most recent execution of the job in test project failed as well.
This time with 'Gateway Timeout' during the 'Clean up legacy Cinder keystone catalog entries' task, but still during overcloud deploy and with the error raised by keystone code.

Since the "Clean up legacy Cinder keystone catalog entries" precedes "Check Keystone public endpoint status" which was the point of original failure, it's not possible to determine if the original issue was resolved, or if it's obfuscated by the new one.

However, given that both tasks follow each other in quick succession, and that error is in both cases raised from within keystone code, it is not implausible that both issues are connected in some fashion.

Trace:
------
2022-08-22 07:07:02 | 2022-08-22 07:07:02.718837 | fa163e0a-00e1-a192-ef98-0000000092a1 | FATAL | Clean up legacy Cinder keystone catalog entries | undercloud | item={'service_name': 'cinderv2', 'service_type': 'volumev2'} | error={"ansible_index_var": "cinder_api_service", "ansible_loop_var": "item", "changed": false, "cinder_api_service": 0, "item": {"service_name": "cinderv2", "service_type": "volumev2"}, "module_stderr": "Traceback (most recent call last):\n File \"<stdin>\", line 107, in <module>\n File \"<stdin>\", line 99, in _ansiballz_main\n File \"<stdin>\", line 47, in invoke_module\n File \"/usr/lib64/python3.9/runpy.py\", line 225, in run_module\n return _run_module_code(code, init_globals, run_name, mod_spec)\n File \"/usr/lib64/python3.9/runpy.py\", line 97, in _run_module_code\n _run_code(code, mod_globals, init_globals,\n File \"/usr/lib64/python3.9/runpy.py\", line 87, in _run_code\n exec(code, run_globals)\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 185, in <module>\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 181, in main\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/module_utils/openstack.py\", line 407, in __call__\n File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 140, in run\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 517, in search_services\n services = self.list_services()\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 492, in list_services\n if self._is_client_version('identity', 2):\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/openstackcloud.py\", line 460, in _is_client_version\n client = getattr(self, client_name)\n File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 31, in _identity_client\n self._raw_clients['identity'] = self._get_versioned_client(\n File \"/usr/lib/python3.9/sit...

The most recent execution of the job in test project failed as well.
This time with 'Gateway Timeout' during the 'Clean up legacy Cinder keystone catalog entries' task, but still during overcloud deploy and with the error raised by keystone code.

Since the "Clean up legacy Cinder keystone catalog entries" precedes "Check Keystone public endpoint status" which was the point of original failure, it's not possible to determine if the original issue was resolved, or if it's obfuscated by the new one.

However, given that both tasks follow each other in quick succession, and that error is in both cases raised from within keystone code, it is not implausible that both issues are connected in some fashion.

Trace:
------
2022-08-22 07:07:02 | 2022-08-22 07:07:02.718837 | fa163e0a-00e1-a192-ef98-0000000092a1 |      FATAL | Clean up legacy Cinder keystone catalog entries | undercloud | item={'service_name': 'cinderv2', 'service_type': 'volumev2'} | error={"ansible_index_var": "cinder_api_service", "ansible_loop_var": "item", "changed": false, "cinder_api_service": 0, "item": {"service_name": "cinderv2", "service_type": "volumev2"}, "module_stderr": "Traceback (most recent call last):\n  File \"<stdin>\", line 107, in <module>\n  File \"<stdin>\", line 99, in _ansiballz_main\n  File \"<stdin>\", line 47, in invoke_module\n  File \"/usr/lib64/python3.9/runpy.py\", line 225, in run_module\n    return _run_module_code(code, init_globals, run_name, mod_spec)\n  File \"/usr/lib64/python3.9/runpy.py\", line 97, in _run_module_code\n    _run_code(code, mod_globals, init_globals,\n  File \"/usr/lib64/python3.9/runpy.py\", line 87, in _run_code\n    exec(code, run_globals)\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 185, in <module>\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 181, in main\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/module_utils/openstack.py\", line 407, in __call__\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_775qz0dd/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 140, in run\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 517, in search_services\n    services = self.list_services()\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 492, in list_services\n    if self._is_client_version('identity', 2):\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/openstackcloud.py\", line 460, in _is_client_version\n    client = getattr(self, client_name)\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 31, in _identity_client\n    self._raw_clients['identity'] = self._get_versioned_client(\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/openstackcloud.py\", line 407, in _get_versioned_client\n    if adapter.get_endpoint():\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/adapter.py\", line 291, in get_endpoint\n    return self.session.get_endpoint(auth or self.auth, **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/session.py\", line 1243, in get_endpoint\n    return auth.get_endpoint(self, **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/base.py\", line 375, in get_endpoint\n    endpoint_data = self.get_endpoint_data(\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/base.py\", line 271, in get_endpoint_data\n    service_catalog = self.get_access(session).service_catalog\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/base.py\", line 134, in get_access\n    self.auth_ref = self.get_auth_ref(session)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/generic/base.py\", line 208, in get_auth_ref\n    return self._plugin.get_auth_ref(session, **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/v3/base.py\", line 187, in get_auth_ref\n    resp = session.post(token_url, json=body, headers=headers,\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/session.py\", line 1149, in post\n    return self.request(url, 'POST', **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/session.py\", line 986, in request\n    raise exceptions.from_response(resp, method, url)\nkeystoneauth1.exceptions.http.GatewayTimeout: Gateway Timeout (HTTP 504)\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
2022-08-22 07:07:02 | 2022-08-22 07:07:02.721694 | fa163e0a-00e1-a192-ef98-0000000092a1 |     TIMING | Clean up legacy Cinder keystone catalog entries | undercloud | 0:47:40.755461 | 91.60s
2022-08-22 07:08:33 | 2022-08-22 07:08:33.695425 | fa163e0a-00e1-a192-ef98-0000000092a1 |      FATAL | Clean up legacy Cinder keystone catalog entries | undercloud | item={'service_name': 'cinderv3', 'service_type': 'volume'} | error={"ansible_index_var": "cinder_api_service", "ansible_loop_var": "item", "changed": false, "cinder_api_service": 1, "item": {"service_name": "cinderv3", "service_type": "volume"}, "module_stderr": "Traceback (most recent call last):\n  File \"<stdin>\", line 107, in <module>\n  File \"<stdin>\", line 99, in _ansiballz_main\n  File \"<stdin>\", line 47, in invoke_module\n  File \"/usr/lib64/python3.9/runpy.py\", line 225, in run_module\n    return _run_module_code(code, init_globals, run_name, mod_spec)\n  File \"/usr/lib64/python3.9/runpy.py\", line 97, in _run_module_code\n    _run_code(code, mod_globals, init_globals,\n  File \"/usr/lib64/python3.9/runpy.py\", line 87, in _run_code\n    exec(code, run_globals)\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_kc2hzxye/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 185, in <module>\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_kc2hzxye/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 181, in main\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_kc2hzxye/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/module_utils/openstack.py\", line 407, in __call__\n  File \"/tmp/ansible_openstack.cloud.catalog_service_payload_kc2hzxye/ansible_openstack.cloud.catalog_service_payload.zip/ansible_collections/openstack/cloud/plugins/modules/catalog_service.py\", line 140, in run\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 517, in search_services\n    services = self.list_services()\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 492, in list_services\n    if self._is_client_version('identity', 2):\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/openstackcloud.py\", line 460, in _is_client_version\n    client = getattr(self, client_name)\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/_identity.py\", line 31, in _identity_client\n    self._raw_clients['identity'] = self._get_versioned_client(\n  File \"/usr/lib/python3.9/site-packages/openstack/cloud/openstackcloud.py\", line 407, in _get_versioned_client\n    if adapter.get_endpoint():\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/adapter.py\", line 291, in get_endpoint\n    return self.session.get_endpoint(auth or self.auth, **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/session.py\", line 1243, in get_endpoint\n    return auth.get_endpoint(self, **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/base.py\", line 375, in get_endpoint\n    endpoint_data = self.get_endpoint_data(\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/base.py\", line 271, in get_endpoint_data\n    service_catalog = self.get_access(session).service_catalog\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/base.py\", line 134, in get_access\n    self.auth_ref = self.get_auth_ref(session)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/generic/base.py\", line 208, in get_auth_ref\n    return self._plugin.get_auth_ref(session, **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/identity/v3/base.py\", line 187, in get_auth_ref\n    resp = session.post(token_url, json=body, headers=headers,\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/session.py\", line 1149, in post\n    return self.request(url, 'POST', **kwargs)\n  File \"/usr/lib/python3.9/site-packages/keystoneauth1/session.py\", line 986, in request\n    raise exceptions.from_response(resp, method, url)\nkeystoneauth1.exceptions.http.GatewayTimeout: Gateway Timeout (HTTP 504)\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}

Log:
----
https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message

Takashi Kajinami (kajinamit) wrote on 2022-08-22 (last edit on 2022-08-22):

#6

The failure mentioned in comment 5 looks related to galera which failed to start

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-1/var/log/containers/haproxy/haproxy.log.txt.gz
~~~
Aug 22 07:07:02 overcloud-controller-1 haproxy[7]: 10.0.0.1:50862 [22/Aug/2022:07:05:32.534] keystone_public~ keystone_public_be/overcloud-controller-1.internalapi.localdomain 0/0/0/81746/81746 504 397 - - ---- 1/1/0/0/0 0/0 "POST /v3/auth/tokens HTTP/1.1"
...

Aug 22 07:08:33 overcloud-controller-1 haproxy[7]: 10.0.0.1:48868 [22/Aug/2022:07:07:03.548] keystone_public~ keystone_public_be/overcloud-controller-1.internalapi.localdomain 0/0/0/82856/82856 504 397 - - ---- 1/1/0/0/0 0/0 "POST /v3/auth/tokens HTTP/1.1"
~~~

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-1/var/log/containers/keystone/keystone.log.txt.gz
~~~
2022-08-22 07:05:32.556 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -1 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2022-08-22 07:05:42.573 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -2 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2022-08-22 07:05:55.739 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -3 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
...
~~~

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
~~~
Failed Resource Actions:
  * galera 10s-interval monitor on galera-bundle-2 returned 'error' (local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:50 2022
  * galera 10s-interval monitor on galera-bundle-1 returned 'error' (local node <overcloud-controller-0> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:49 2022
  * galera promote on galera-bundle-0 returned 'error' (MySQL server failed to start (pid=2516) (rc=0), please check your installation) at Mon Aug 22 06:41:48 2022 after 33.062s
~~~

The failure mentioned in comment 5 looks related to galera which failed to start

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-1/var/log/containers/haproxy/haproxy.log.txt.gz
~~~
Aug 22 07:07:02 overcloud-controller-1 haproxy[7]: 10.0.0.1:50862 [22/Aug/2022:07:05:32.534] keystone_public~ keystone_public_be/overcloud-controller-1.internalapi.localdomain 0/0/0/81746/81746 504 397 - - ---- 1/1/0/0/0 0/0 "POST /v3/auth/tokens HTTP/1.1"
...

Aug 22 07:08:33 overcloud-controller-1 haproxy[7]: 10.0.0.1:48868 [22/Aug/2022:07:07:03.548] keystone_public~ keystone_public_be/overcloud-controller-1.internalapi.localdomain 0/0/0/82856/82856 504 397 - - ---- 1/1/0/0/0 0/0 "POST /v3/auth/tokens HTTP/1.1"
~~~

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-1/var/log/containers/keystone/keystone.log.txt.gz
~~~
2022-08-22 07:05:32.556 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -1 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2022-08-22 07:05:42.573 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -2 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2022-08-22 07:05:55.739 27 WARNING oslo_db.sqlalchemy.engines [None req-da352218-69c2-4ddc-8c66-30d5ee601c84 - - - - - -] SQL connection failed. -3 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
...
~~~

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-0/var/log/extra/pcs.txt.gz
~~~
Failed Resource Actions:
  * galera 10s-interval monitor on galera-bundle-2 returned 'error' (local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:50 2022
  * galera 10s-interval monitor on galera-bundle-1 returned 'error' (local node <overcloud-controller-0> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:49 2022
  * galera promote on galera-bundle-0 returned 'error' (MySQL server failed to start (pid=2516) (rc=0), please check your installation) at Mon Aug 22 06:41:48 2022 after 33.062s
~~~

Revision history for this message

Ronelle Landy (rlandy) wrote on 2022-08-22:

#7

iiuc, this is still an OVB issue - but not a standalone problem any more?

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-22:

#8

Indeed. That has seemingly recovered.

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-22:

#9

Galera bundle logs[0] on controller-0 have considerable number of errors. I'm looking for suitable baseline to establish diff.

[0] https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-0/var/log/containers/stdouts/galera-bundle.log.txt.gz

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-22:

#10

Download full text (5.2 KiB)

Comparing galera bundle logs from failing job[0] with those from successful run of sister job[1]
shows some divergence fairly early. With error:

2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) info: galera_start_0[56] error output [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]

Followed by others.

Trace:
------
2022-08-22T06:40:59.070682554+00:00 stderr F (log_execute) info: executing - rsc:galera action:start call_id:12
2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) info: galera_start_0[56] error output [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]
2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) info: galera_start_0[56] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:01.472239128+00:00 stderr F (log_finished) info: galera start (call 12, PID 56) exited with status 0 (execution time 2.401s)
2022-08-22T06:41:02.702300941+00:00 stderr F (log_op_output) info: galera_monitor_30000[621] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:06.535741387+00:00 stderr F (log_op_output) info: galera_monitor_20000[755] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:13.340246042+00:00 stderr F (cancel_recurring_action) info: Cancelling ocf operation galera_monitor_20000
2022-08-22T06:41:13.381264500+00:00 stderr F (cancel_recurring_action) info: Cancelling ocf operation galera_monitor_30000
2022-08-22T06:41:13.387622857+00:00 stderr F (log_execute) info: executing - rsc:galera action:promote call_id:176
2022-08-22T06:41:23.214099415+00:00 stderr F (log_op_output) info: galera_promote_0[898] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:23.214099415+00:00 stderr F (log_op_output) info: galera_promote_0[898] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:23.214099415+00:00 stderr F (log_finished) info: galera promote (call 176, PID 898) exited with status 0 (execution time 9.827s)
2022-08-22T06:41:49.722447502+00:00 stderr F (log_op_output) info: galera_monitor_10000[1718] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:41:59.811386227+00:00 stderr F (log_op_output) info: galera_monitor_10000[1779] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:42:12.085343779+00:00 stderr F (log_op_output) info: galera_monitor_10000[1839] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:42:22.216097374+00:00 stderr F (log_op_output) info: galera_monitor_10000[1899] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in...

Comparing galera bundle logs from failing job[0] with those from successful run of sister job[1] 
shows some divergence fairly early. With error:

2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) 	info: galera_start_0[56] error output [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]

Followed by others.

Trace:
------
2022-08-22T06:40:59.070682554+00:00 stderr F (log_execute) 	info: executing - rsc:galera action:start call_id:12
2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) 	info: galera_start_0[56] error output [ cat: /var/lib/mysql/grastate.dat: No such file or directory ]
2022-08-22T06:41:01.472239128+00:00 stderr F (log_op_output) 	info: galera_start_0[56] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:01.472239128+00:00 stderr F (log_finished) 	info: galera start (call 12, PID 56) exited with status 0 (execution time 2.401s)
2022-08-22T06:41:02.702300941+00:00 stderr F (log_op_output) 	info: galera_monitor_30000[621] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:06.535741387+00:00 stderr F (log_op_output) 	info: galera_monitor_20000[755] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:13.340246042+00:00 stderr F (cancel_recurring_action) 	info: Cancelling ocf operation galera_monitor_20000
2022-08-22T06:41:13.381264500+00:00 stderr F (cancel_recurring_action) 	info: Cancelling ocf operation galera_monitor_30000
2022-08-22T06:41:13.387622857+00:00 stderr F (log_execute) 	info: executing - rsc:galera action:promote call_id:176
2022-08-22T06:41:23.214099415+00:00 stderr F (log_op_output) 	info: galera_promote_0[898] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:23.214099415+00:00 stderr F (log_op_output) 	info: galera_promote_0[898] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:41:23.214099415+00:00 stderr F (log_finished) 	info: galera promote (call 176, PID 898) exited with status 0 (execution time 9.827s)
2022-08-22T06:41:49.722447502+00:00 stderr F (log_op_output) 	info: galera_monitor_10000[1718] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:41:59.811386227+00:00 stderr F (log_op_output) 	info: galera_monitor_10000[1779] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:42:12.085343779+00:00 stderr F (log_op_output) 	info: galera_monitor_10000[1839] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:42:22.216097374+00:00 stderr F (log_op_output) 	info: galera_monitor_10000[1899] error output [ ocf-exit-reason:local node <overcloud-controller-0> is started, but not in primary mode. Unknown state. ]
2022-08-22T06:42:30.503701528+00:00 stderr F (cancel_recurring_action) 	info: Cancelling ocf operation galera_monitor_10000
2022-08-22T06:42:30.504949056+00:00 stderr F (log_execute) 	info: executing - rsc:galera action:demote call_id:212
2022-08-22T06:42:32.274442384+00:00 stderr F (log_finished) 	info: galera demote (call 212, PID 1959) exited with status 0 (execution time 1.770s)
2022-08-22T06:43:10.403053905+00:00 stderr F (log_execute) 	info: executing - rsc:galera action:stop call_id:249
2022-08-22T06:43:10.925882825+00:00 stderr F (log_finished) 	info: galera stop (call 249, PID 2051) exited with status 0 (execution time 524ms)
2022-08-22T06:43:36.579627653+00:00 stderr F (log_execute) 	info: executing - rsc:galera action:start call_id:276
2022-08-22T06:43:37.957375141+00:00 stderr F (log_op_output) 	info: galera_start_0[2103] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:43:37.957375141+00:00 stderr F (log_finished) 	info: galera start (call 276, PID 2103) exited with status 0 (execution time 1.379s)
2022-08-22T06:43:40.154014691+00:00 stderr F (log_op_output) 	info: galera_monitor_30000[2324] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:43:41.345681452+00:00 stderr F (log_op_output) 	info: galera_monitor_20000[2460] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
2022-08-22T06:44:02.309356998+00:00 stderr F (log_op_output) 	info: galera_monitor_20000[2597] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]

...

[1]https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-master/c43c4fb/logs/overcloud-controller-0/var/log/containers/stdouts/galera-bundle.log.txt.gz

Revision history for this message

Luca Miccini (lmiccini2) wrote on 2022-08-24:

#11

Download full text (10.0 KiB)

* Container bundle set: galera-bundle [cluster.common.tag/mariadb:pcmklatest]:
    * galera-bundle-0 (ocf:heartbeat:galera): FAILED Promoted overcloud-controller-2 (blocked)
    * galera-bundle-1 (ocf:heartbeat:galera): Unpromoted overcloud-controller-0
    * galera-bundle-2 (ocf:heartbeat:galera): Unpromoted overcloud-controller-1

Failed Resource Actions:
  * galera 10s-interval monitor on galera-bundle-2 returned 'error' (local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:50 2022
  * galera 10s-interval monitor on galera-bundle-1 returned 'error' (local node <overcloud-controller-0> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:49 2022
  * galera promote on galera-bundle-0 returned 'error' (MySQL server failed to start (pid=2516) (rc=0), please check your installation) at Mon Aug 22 06:41:48 2022 after 33.062s

if we look a the controllers journals it looks like they are isolated from a network perspective for a few seconds (maybe it's a real network hiccup or maybe it is just an oversubscribed underlying hypervisor):

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-0/var/log/extra/journal.txt.gz

Aug 22 06:41:30 overcloud-controller-0 pacemaker-attrd[21566]: notice: Setting galera-no-grastate[overcloud-controller-1]: true -> (unset)
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-1.internalapi.localdomain is DOWN, reason: Layer4 timeout, check duration: 1001ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-2.internalapi.localdomain is DOWN, reason: Layer4 timeout, check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: backend mysql_be has no server available!
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [MAIN ] Corosync main process was not scheduled (@1661150499475) for 9392.5205 ms (threshold is 8520.0000 ms). Consider token timeout increase.
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] link: host: 3 link: 0 is down
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] link: host: 2 link: 0 is down
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 3 has no active links
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]: [KNET ] host: host: 2 has no active links
…
Aug 22 06:41:40 overcloud-controller-0 corosync[21550]: [TOTEM ] Token has not been received in 7987 ms
Aug 22 06:41:40 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-1.internalapi.loca...

* Container bundle set: galera-bundle [cluster.common.tag/mariadb:pcmklatest]:
    * galera-bundle-0    (ocf:heartbeat:galera):     FAILED Promoted overcloud-controller-2 (blocked)
    * galera-bundle-1    (ocf:heartbeat:galera):     Unpromoted overcloud-controller-0
    * galera-bundle-2    (ocf:heartbeat:galera):     Unpromoted overcloud-controller-1

Failed Resource Actions:
  * galera 10s-interval monitor on galera-bundle-2 returned 'error' (local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:50 2022
  * galera 10s-interval monitor on galera-bundle-1 returned 'error' (local node <overcloud-controller-0> is started, but not in primary mode. Unknown state.) at Mon Aug 22 06:41:49 2022
  * galera promote on galera-bundle-0 returned 'error' (MySQL server failed to start (pid=2516) (rc=0), please check your installation) at Mon Aug 22 06:41:48 2022 after 33.062s

if we look a the controllers journals it looks like they are isolated from a network perspective for a few seconds (maybe it's a real network hiccup or maybe it is just an oversubscribed underlying hypervisor):

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-0/var/log/extra/journal.txt.gz

Aug 22 06:41:30 overcloud-controller-0 pacemaker-attrd[21566]:  notice: Setting galera-no-grastate[overcloud-controller-1]: true -> (unset)
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-1.internalapi.localdomain is DOWN, reason: Layer4 timeout, check duration: 1001ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-2.internalapi.localdomain is DOWN, reason: Layer4 timeout, check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 22 06:41:38 overcloud-controller-0 haproxy[30582]: backend mysql_be has no server available!
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]:   [MAIN  ] Corosync main process was not scheduled (@1661150499475) for 9392.5205 ms (threshold is 8520.0000 ms). Consider token timeout increase.
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]:   [KNET  ] link: host: 3 link: 0 is down
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]:   [KNET  ] link: host: 2 link: 0 is down
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]:   [KNET  ] host: host: 3 has no active links
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-0 corosync[21550]:   [KNET  ] host: host: 2 has no active links
…
Aug 22 06:41:40 overcloud-controller-0 corosync[21550]:   [TOTEM ] Token has not been received in 7987 ms
Aug 22 06:41:40 overcloud-controller-0 haproxy[30582]: Backup Server mysql_be/overcloud-controller-1.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, check duration: 13ms. 0 active and 1 backup servers online. Running on backup. 0 sessions requeued, 0 total in queue.
Aug 22 06:41:43 overcloud-controller-0 corosync[21550]:   [TOTEM ] A processor failed, forming new configuration: token timed out (10650ms), waiting 12780ms for consensus.
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [KNET  ] rx: host: 3 link: 0 is up
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [QUORUM] Sync members[3]: 1 2 3
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [TOTEM ] A new membership (1.29) was formed. Members
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [QUORUM] Members[3]: 1 2 3
Aug 22 06:41:45 overcloud-controller-0 corosync[21550]:   [MAIN  ] Completed service synchronization, ready to provide service.

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-1/var/log/extra/journal.txt.gz

Aug 22 06:41:39 overcloud-controller-1 corosync[21882]:   [KNET  ] link: host: 3 link: 0 is down
Aug 22 06:41:39 overcloud-controller-1 corosync[21882]:   [KNET  ] link: host: 1 link: 0 is down
Aug 22 06:41:39 overcloud-controller-1 corosync[21882]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-1 corosync[21882]:   [KNET  ] host: host: 3 has no active links
Aug 22 06:41:39 overcloud-controller-1 corosync[21882]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 22 06:41:39 overcloud-controller-1 corosync[21882]:   [KNET  ] host: host: 1 has no active links
Aug 22 06:41:39 overcloud-controller-1 kernel: IN=enp5s0 OUT= MAC=fa:16:3e:bc:b9:24:fa:16:3e:64:a3:e4:08:00 SRC=172.17.0.60 DST=172.17.0.218 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=6080 DPT=42312 WINDOW=0 RES=0x00 ACK RST URGP=0 
Aug 22 06:41:39 overcloud-controller-1 kernel: IN=enp5s0 OUT= MAC=fa:16:3e:bc:b9:24:fa:16:3e:64:a3:e4:08:00 SRC=172.17.0.60 DST=172.17.0.218 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=8000 DPT=54788 WINDOW=0 RES=0x00 ACK RST URGP=0 
Aug 22 06:41:39 overcloud-controller-1 kernel: IN=enp5s0 OUT= MAC=fa:16:3e:bc:b9:24:fa:16:3e:64:a3:e4:08:00 SRC=172.17.0.60 DST=172.17.0.218 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=8778 DPT=41478 WINDOW=0 RES=0x00 ACK RST URGP=0 
Aug 22 06:41:39 overcloud-controller-1 kernel: IN=enp5s0 OUT= MAC=fa:16:3e:bc:b9:24:fa:16:3e:64:a3:e4:08:00 SRC=172.17.0.60 DST=172.17.0.218 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=8004 DPT=39178 WINDOW=0 RES=0x00 ACK RST URGP=0 
Aug 22 06:41:39 overcloud-controller-1 kernel: IN=enp5s0 OUT= MAC=fa:16:3e:bc:b9:24:fa:16:3e:64:a3:e4:08:00 SRC=172.17.0.60 DST=172.17.0.218 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=9696 DPT=51742 WINDOW=0 RES=0x00 ACK RST URGP=0 
Aug 22 06:41:39 overcloud-controller-1 kernel: TCP: request_sock_TCP: Possible SYN flooding on port 9200. Sending cookies.  Check SNMP counters.
Aug 22 06:41:40 overcloud-controller-1 corosync[21882]:   [TOTEM ] Token has not been received in 7987 ms
Aug 22 06:41:43 overcloud-controller-1 corosync[21882]:   [TOTEM ] A processor failed, forming new configuration: token timed out (10650ms), waiting 12780ms for consensus.
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [KNET  ] rx: host: 1 link: 0 is up
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [KNET  ] rx: host: 3 link: 0 is up
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [QUORUM] Sync members[3]: 1 2 3
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [TOTEM ] A new membership (1.29) was formed. Members
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [QUORUM] Members[3]: 1 2 3
Aug 22 06:41:45 overcloud-controller-1 corosync[21882]:   [MAIN  ] Completed service synchronization, ready to provide service.

https://logserver.rdoproject.org/93/44593/2/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/fea5cd4/logs/overcloud-controller-2/var/log/extra/journal.txt.gz

Aug 22 06:41:37 overcloud-controller-2 corosync[21784]:   [KNET  ] link: host: 2 link: 0 is down
Aug 22 06:41:37 overcloud-controller-2 corosync[21784]:   [KNET  ] link: host: 1 link: 0 is down
Aug 22 06:41:37 overcloud-controller-2 corosync[21784]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 06:41:37 overcloud-controller-2 corosync[21784]:   [KNET  ] host: host: 2 has no active links
Aug 22 06:41:37 overcloud-controller-2 corosync[21784]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 22 06:41:37 overcloud-controller-2 corosync[21784]:   [KNET  ] host: host: 1 has no active links
…
Aug 22 06:41:40 overcloud-controller-2 corosync[21784]:   [TOTEM ] Token has not been received in 7987 ms
Aug 22 06:41:40 overcloud-controller-2 haproxy[34843]: Backup Server mysql_be/overcloud-controller-2.internalapi.localdomain is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 17ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 22 06:41:40 overcloud-controller-2 haproxy[34843]: backend mysql_be has no server available!
Aug 22 06:41:41 overcloud-controller-2 haproxy[34843]: Backup Server mysql_be/overcloud-controller-1.internalapi.localdomain is UP, reason: Layer7 check passed, code: 200, check duration: 13ms. 0 active and 1 backup servers online. Running on backup. 0 sessions requeued, 0 total in queue.
Aug 22 06:41:42 overcloud-controller-2 corosync[21784]:   [KNET  ] rx: host: 2 link: 0 is up
Aug 22 06:41:42 overcloud-controller-2 corosync[21784]:   [KNET  ] rx: host: 1 link: 0 is up
Aug 22 06:41:42 overcloud-controller-2 corosync[21784]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 22 06:41:42 overcloud-controller-2 corosync[21784]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 22 06:41:43 overcloud-controller-2 corosync[21784]:   [TOTEM ] A processor failed, forming new configuration: token timed out (10650ms), waiting 12780ms for consensus.

The galera resource agent tries to recover but it probably ends up in a corner case were the bootstrap node fails to promote and it's not getting a chance to bootstrap again.

I am not sure we can suggest anything that could help mitigate this problem right now. We'll look into the resource agent to see if we can improve the handling of this failure condition in the future.

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-24:

#12

I can confirm that this bug is also hitting wallaby version of the job.
periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation

Logs:
-----
https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation/309130d/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message

Soniya Murlidhar Vyas (svyas) wrote on 2022-08-24:

#13

periodic-tripleo-ci-centos-9-standalone-validation-master is green now[1].

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-standalone-validation-master&skip=0

Revision history for this message

Soniya Murlidhar Vyas (svyas) wrote on 2022-08-24:

#14

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation is also green[2]

[2] https://review.rdoproject.org/zuul/build/125a8b5120b94f44aa092045bce575d1

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-24:

#15

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation is still failing. The master has passed once, so maybe it will get better too.

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-25:

#16

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-wallaby-validation is now passing[3]. There is also second execution of the master job[4]

[3]https://review.rdoproject.org/zuul/build/628974f5963e43729a3e94b45848c674
[4]https://review.rdoproject.org/zuul/build/1afd570571be4bc5b94d87d62106f822

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-08-29:

#17

There was one successful run of the periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation on Saturday.

But since then there have been several failures[1]. Two due to 'extras-common' repo being unavailable. One during overcloud deploy on "Discovering nova hosts",[2] with 'Lost connection to MySQL server during query" as an error.

The job is still unstable. And controller-0 logs[[3] contain galera errors, so it could be the same issue.

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation&project=openstack/tripleo-ci
[2] https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/acda03f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-component-validation/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-component-master-validation/acda03f/logs/overcloud-controller-0/var/log/extra/journal_errors.txt.gz

Revision history for this message

Jakob Meng (jm1337) wrote on 2022-08-30:

#18

The same galera issue can be observed in our periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby job as well.

overcloud-controller-2/var/log/containers/stdouts/galera-bundle.log.txt.gz [1]:

2022-08-30T02:23:26.011709616+00:00 stderr F (log_op_output) info: galera_monitor_10000[7104] error output [ ocf-exit-reason:local node <overcloud-controller-2> is started, but not in primary mode. Unknown state. ]

overcloud-controller-2/var/log/containers/mysql/mysqld.log.txt.gz [2]:

2022-08-30 2:23:31 0 [Note] /usr/libexec/mariadbd (initiated by: unknown): Normal shutdown

In this case, the job failed in [3]:

TASK [os_tempest : Executing python-tempestconf]

with a completely different error [4]:

2022-08-30 02:26:02.485 364416 INFO tempest.lib.common.rest_client [req-ade2427a-1652-4a3c-a44a-a8f661753257 ] Request (main): 500 GET https://10.0.0.5:13000/v3/projects 0.028s[00m
2022-08-30 02:26:02.485 364416 CRITICAL tempest [-] Unhandled error: tempest.lib.exceptions.ServerFault: Got server fault

while Tempest tries to access Keystone which fails because the db connection is closed [5]:

2022-08-30 02:25:06.493 180 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

[1] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/stdouts/galera-bundle.log.txt.gz

[2] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/mysql/mysqld.log.txt.gz

[3] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/job-output.txt

[4] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/undercloud/var/log/tempest/tempestconf.log.txt.gz

[5] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/keystone/keystone.log.txt.gz

The same galera issue can be observed in our periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby job as well.

overcloud-controller-2/var/log/containers/stdouts/galera-bundle.log.txt.gz [1]:

2022-08-30T02:23:26.011709616+00:00 stderr F (log_op_output) 	info: galera_monitor_10000[7104] error output [ ocf-exit-reason:local node <overcloud-controller-2> is started, but not in primary mode. Unknown state. ]

overcloud-controller-2/var/log/containers/mysql/mysqld.log.txt.gz [2]:

2022-08-30  2:23:31 0 [Note] /usr/libexec/mariadbd (initiated by: unknown): Normal shutdown

In this case, the job failed in [3]:

TASK [os_tempest : Executing python-tempestconf]

with a completely different error [4]:

2022-08-30 02:26:02.485 364416 INFO tempest.lib.common.rest_client [req-ade2427a-1652-4a3c-a44a-a8f661753257 ] Request (main): 500 GET https://10.0.0.5:13000/v3/projects 0.028s[00m
  2022-08-30 02:26:02.485 364416 CRITICAL tempest [-] Unhandled error: tempest.lib.exceptions.ServerFault: Got server fault

while Tempest tries to access Keystone which fails because the db connection is closed [5]:

2022-08-30 02:25:06.493 180 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

[1] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/stdouts/galera-bundle.log.txt.gz

[2] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/mysql/mysqld.log.txt.gz

[3] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/job-output.txt

[4] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/undercloud/var/log/tempest/tempestconf.log.txt.gz

[5] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/e9a3692/logs/overcloud-controller-2/var/log/containers/keystone/keystone.log.txt.gz

Jiri Podivin (jpodivin) on 2022-08-31

summary:	- periodic-tripleo-ci-centos-9-standalone-validation-master fails on - 'Check Keystone public endpoint status' + Pacemaker performance causes intermittent galera issues
summary:	- Pacemaker performance causes intermittent galera issues + Pacemaker performance causes intermittent galera issues in loaded CI env

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-09-22: Related fix proposed to tripleo-heat-templates (master)

#19

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/858945

Revision history for this message

Ananya Banerjee (frenzyfriday) wrote on 2022-09-26:

#20

periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby looks green now

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-09-26:

#21

We had 4 green executions in a row. So I would call it pretty stable.

Revision history for this message

Ananya Banerjee (frenzyfriday) wrote on 2022-09-27:

#22

Saw this for fs001 baremetal master today

https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-master/bd32398/logs/overcloud-controller-0/var/log/extra/journal.txt.gz

Revision history for this message

Soniya Murlidhar Vyas (svyas) wrote on 2022-10-03:

#23

fs01 baremetal wallaby failed on tempest not sure if related to this issue or not[1]

[1] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/8dd4238/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message

Soniya Murlidhar Vyas (svyas) wrote on 2022-10-03:

#24

fs01 baremeral master also fails due to same issue[1]

https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-master/8f2122c/logs/overcloud-controller-0/var/log/extra/journal_errors.txt.gz

Revision history for this message

Jakob Meng (jm1337) wrote on 2022-10-11:

#25

Luca's patches [1] and [2] both do work properly but they have not been merged yet due to unrelated job failures. Ronelle's testproject [3] shows that both settings are taking effect [4],[5]. We do not know though, whether this really solves our intermittent issues. We are rekicking another testproject [6] to see whether we can reproduce this galera error even though we have Luca's patches applied. Lets cross fingers that those galera issues are really finally solved by Luca ☺️

[1] https://review.opendev.org/c/openstack/puppet-tripleo/+/859553
[2] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859568
[3] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712
[4] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712/comments/a41ba2b7_ed43e251
[5] https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712/comments/2be73f3a_6ad0cc15
[6] https://review.rdoproject.org/r/c/testproject/+/45225

Revision history for this message

Ananya Banerjee (frenzyfriday) wrote on 2022-10-17:

#26

Luca's patches were merged. Backports are ready:
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861290 https://review.opendev.org/c/openstack/puppet-tripleo/+/861291/2

https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712 still needs to merge

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-10-17: Related fix merged to tripleo-heat-templates (master)

#27

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/859712
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/eb57ba758b69ca95650ea18d5846f28397411936
Submitter: "Zuul (22348)"
Branch: master

commit eb57ba758b69ca95650ea18d5846f28397411936
Author: Ronelle Landy <email address hidden>
Date: Wed Sep 28 15:25:14 2022 -0400

Sets higher values for timeouts

- corosync token_timeout: 30000
- puppet pacemaker: evs.suspect_timeout=PT30S

    These higher values are an attempt to avoid
    slower operations hitting errors on overcloud
    deployments in OVB jobs.

Related-Bug: #1987092
Change-Id: Idd8170e435335566e9d1114675259d4d8d603364

Rabi Mishra (rabi) on 2022-10-19

tags:

added: ovb

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-10-20: Related fix proposed to tripleo-heat-templates (stable/wallaby)

#28

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861846

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-10-24: Related fix merged to tripleo-heat-templates (stable/wallaby)

#29

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/861846
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/b58b1d6e9c4c920052d96fedc261cac8155f1c04
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit b58b1d6e9c4c920052d96fedc261cac8155f1c04
Author: Ronelle Landy <email address hidden>
Date: Wed Sep 28 15:25:14 2022 -0400

Sets higher values for timeouts

- corosync token_timeout: 30000
- puppet pacemaker: evs.suspect_timeout=PT30S

    These higher values are an attempt to avoid
    slower operations hitting errors on overcloud
    deployments in OVB jobs.

    Related-Bug: #1987092
    Change-Id: Idd8170e435335566e9d1114675259d4d8d603364
    (cherry picked from commit eb57ba758b69ca95650ea18d5846f28397411936)

tags:

added: in-stable-wallaby

Ronelle Landy (rlandy) on 2022-10-26

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-02-26: Change abandoned on tripleo-heat-templates (master)

#30

Change abandoned by "Ghanshyam <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/858945
Reason: TrieplO project is retiring now, for details, please see https://review.opendev.org/c/openstack/governance/+/905145 or reach out to OpenStack TC.

tripleo

Pacemaker performance causes intermittent galera issues in loaded CI env

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches