Openstack-Helm mariadb crash recovery failed

Bug #1922299 reported by Yvonne Ding
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
chen haochuan

Bug Description

Brief Description
-----------------
Stx-openstack apply-failed after host swact controller and lock/unlock standby controller. MariaDB pods in the cluster should not be affected by a swact.

Severity
--------
Major

Steps to Reproduce
------------------
host swact
Lock standby controller
Unlock standby controller
system application-list

Expected Behavior
------------------
stx-openstack apply success

Actual Behavior
----------------
stx-openstack apply failed

Reproducibility
---------------
reproducible

System Configuration
--------------------
Standard 2+2

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="21.05"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20210401T032802Z"

Timestamp/Logs
--------------
nova scheduler
==============
2021-04-01T20:12:05.392844204Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines [-] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2021-04-01T20:12:05.392865121Z stdout F [SQL: SELECT 1]
2021-04-01T20:12:05.392870469Z stdout F (Background on this error at: http://sqlalche.me/e/e3q8)
2021-04-01T20:12:05.392874788Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2021-04-01T20:12:05.392879282Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
2021-04-01T20:12:05.39288348Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines cursor, statement, parameters, context
2021-04-01T20:12:05.392887573Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
2021-04-01T20:12:05.39289165Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2021-04-01T20:12:05.392895426Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/cursors.py", line 170, in execute
2021-04-01T20:12:05.392900839Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2021-04-01T20:12:05.392904817Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/cursors.py", line 328, in _query
2021-04-01T20:12:05.392908972Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2021-04-01T20:12:05.392912707Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 517, in query
2021-04-01T20:12:05.392916774Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2021-04-01T20:12:05.392920523Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 732, in _read_query_result
2021-04-01T20:12:05.39292429Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines result.read()
2021-04-01T20:12:05.392928166Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 1075, in read
2021-04-01T20:12:05.392931826Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet()
2021-04-01T20:12:05.392935998Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 657, in _read_packet
2021-04-01T20:12:05.392939863Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4)
2021-04-01T20:12:05.392943824Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 707, in _read_bytes
2021-04-01T20:12:05.392947824Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2021-04-01T20:12:05.392951943Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2021-04-01T20:12:05.392956015Z stdout F 2021-04-01 20:12:05.388 1 ERROR oslo_db.sqlalchemy.engines

......
2021-04-01T20:12:17.417007851Z stdout F 2021-04-01 20:12:17.415 1 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2021-04-01T20:12:17.417011342Z stdout F 2021-04-01 20:12:17.415 1 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2021-04-01T20:12:17.417025119Z stdout F 2021-04-01 20:12:17.415 1 ERROR oslo_db.sqlalchemy.engines [SQL: SELECT 1]
2021-04-01T20:12:17.417046489Z stdout F 2021-04-01 20:12:17.415 1 ERROR oslo_db.sqlalchemy.engines (Background on this error at: http://sqlalche.me/e/e3q8)
2021-04-01T20:12:17.417051859Z stdout F 2021-04-01 20:12:17.415 1 ERROR oslo_db.sqlalchemy.engines

Mariadb
=======
2021-04-01T20:10:48.639819377Z stderr F 2021-04-01 20:10:48,639 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:48 140439366817472 [Note] InnoDB: Initializing buffer pool, total size = 1G, instances = 8, chunk size = 128M'
2021-04-01T20:10:48.676400175Z stderr F 2021-04-01 20:10:48,676 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:48 140439366817472 [Note] InnoDB: Completed initialization of buffer pool'
2021-04-01T20:10:48.677173326Z stderr F 2021-04-01 20:10:48,677 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:48 140437742954240 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().'
2021-04-01T20:10:48.702813372Z stderr F 2021-04-01 20:10:48,702 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:48 140439366817472 [Note] InnoDB: Highest supported file format is Barracuda.'
2021-04-01T20:10:48.704745594Z stderr F 2021-04-01 20:10:48,704 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:48 140439366817472 [Note] InnoDB: Starting crash recovery from checkpoint LSN=47390445'
2021-04-01T20:10:49.077905158Z stderr F 2021-04-01 20:10:49,077 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140439366817472 [Note] InnoDB: 128 out of 128 rollback segments are active.'
2021-04-01T20:10:49.079157287Z stderr F 2021-04-01 20:10:49,078 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140439366817472 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"'
2021-04-01T20:10:49.079172298Z stderr F 2021-04-01 20:10:49,079 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140439366817472 [Note] InnoDB: Creating shared tablespace for temporary tables'
2021-04-01T20:10:49.079198426Z stderr F 2021-04-01 20:10:49,079 - OpenStack-Helm Mariadb - INFO - b"2021-04-01 20:10:49 140439366817472 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ..."
2021-04-01T20:10:49.080041521Z stderr F 2021-04-01 20:10:49,079 - OpenStack-Helm Mariadb - INFO - b"2021-04-01 20:10:49 140439366817472 [Note] InnoDB: File './ibtmp1' size is now 12 MB."
2021-04-01T20:10:49.082732472Z stderr F 2021-04-01 20:10:49,082 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140439366817472 [Note] InnoDB: Waiting for purge to start'
2021-04-01T20:10:49.132968943Z stderr F 2021-04-01 20:10:49,132 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140439366817472 [Note] InnoDB: 5.7.29 started; log sequence number 47390454'
2021-04-01T20:10:49.133013416Z stderr F 2021-04-01 20:10:49,132 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140437167793920 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool'
2021-04-01T20:10:49.142892943Z stderr F 2021-04-01 20:10:49,142 - OpenStack-Helm Mariadb - INFO - b"2021-04-01 20:10:49 140439366817472 [Note] Plugin 'FEEDBACK' is disabled."
2021-04-01T20:10:49.142916321Z stderr F 2021-04-01 20:10:49,142 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140439366817472 [Note] Recovering after a crash using tc.log'
2021-04-01T20:10:49.143467469Z stderr F 2021-04-01 20:10:49,143 - OpenStack-Helm Mariadb - INFO - b'2021-04-01 20:10:49 140439366817472 [ERROR] Recovery failed! You must enable all engines that were enabled at the moment of the crash'
2021-04-01T20:10:51.281596579Z stderr F 2021-04-01 20:10:51,281 - OpenStack-Helm Mariadb - INFO - b''
2021-04-01T20:10:51.281617047Z stderr F 2021-04-01 20:10:51,281 - OpenStack-Helm Mariadb - INFO - b"2021-04-01 20:10:49 140439366817472 [ERROR] Crash recovery failed. Either correct the problem (if it's, for example, out of memory error) and restart, or delete tc log and start mysqld with --tc-heuristic-recover={commit|rollback}"
2021-04-01T20:10:51.281672087Z stderr F 2021-04-01 20:10:51,281 - OpenStack-Helm Mariadb - INFO - b''
2021-04-01T20:10:51.281699365Z stderr F 2021-04-01 20:10:51,281 - OpenStack-Helm Mariadb - INFO - b"2021-04-01 20:10:49 140439366817472 [ERROR] Can't init tc log"
2021-04-01T20:10:51.281897339Z stderr F 2021-04-01 20:10:51,281 - OpenStack-Helm Mariadb - INFO - Launching MariaDB

Revision history for this message
Yvonne Ding (yding) wrote :
Revision history for this message
Yvonne Ding (yding) wrote :

This issue is introduced either from "20210331T031417Z" load or "20210401T032802Z" load. Please check these two CHANGLOG.

Ghada Khalil (gkhalil)
tags: added: stx.distro.openstack
Revision history for this message
Austin Sun (sunausti) wrote :

@martin: as you fix mariadb issue before, could you triage this issue firstly ? is this similar issue as before ?

Changed in starlingx:
assignee: nobody → chen haochuan (martin1982)
Revision history for this message
chen haochuan (martin1982) wrote :

Please check the latest build, with this patch already merged on 04/08
https://review.opendev.org/c/starlingx/config/+/786092

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking this as a high for stx.5.0 since the openstack application apply failure is reproducible. Based on the comment above, it appears that this maybe a duplicate of https://bugs.launchpad.net/starlingx/+bug/1917308

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.5.0
Revision history for this message
Austin Sun (sunausti) wrote :

https://bugs.launchpad.net/starlingx/+bug/1917308 ticket are merged , once cherrypick stx.5.0 , please check if this issue is gone too. thanks.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Fix for LP: https://bugs.launchpad.net/starlingx/+bug/1917308 merged in r/stx.5.0 on 2021-04-27

@Austin, if your team confirmed that this is indeed a duplicate, please set the status to Fix Released and mark as a duplicate.

Revision history for this message
Austin Sun (sunausti) wrote :

Done.
Yvonne: would you check this issue is gone or not after 4/27 stx.5.0 iso. Thanks.

Revision history for this message
Yvonne Ding (yding) wrote :

stx-openstack apply successful in lastest stx.5.0 load. But nova scheduler and mariadb-server errors still exist.

OS="centos"
SW_VERSION="21.05"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.5.0"

JOB="STX_5.0_build_layer_flock"
<email address hidden>"
BUILD_NUMBER="13"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-04-28 23:03:21 +0000"

Revision history for this message
chen haochuan (martin1982) wrote :

please upload the log file, by collect --all

Revision history for this message
Austin Sun (sunausti) wrote :

remove duplicate since this issue is still seen in latest stx.5.0.

Revision history for this message
Yvonne Ding (yding) wrote :
Revision history for this message
Austin Sun (sunausti) wrote :

Hi Yvonne:
   I tried in my side, using stx.5.0 0505 build, even there are error logs in nova-schedule, and mariadb, but the openstack function are working fine.
   would you check in your setup , after there are errors in logs(nova-shedule and mariadb) , is the openstack function working ?
   if function is working, then I don't think this is real issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
chen haochuan (martin1982) wrote :

I fixed this issue, but without the fix, when one controller is locked, nova-schedule and some other pod will just only print the error log in issue description, the openstack service still goes well.

Revision history for this message
Yvonne Ding (yding) wrote :

Run openstack regular sanity with stx/5.0 0510 build on a storage 2+4+2 lab, sanity is green.

Overall Status: GREEN
Lab: IP_20-27
Node Config: 2+4+2
Load: r/stx.5.0
Job: STX_5.0_build_layer_flock

OS="centos"
SW_VERSION="21.05"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.5.0"

JOB="STX_5.0_build_layer_flock"
<email address hidden>"
BUILD_NUMBER="25"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-05-10 15:54:33 +0000"

Revision history for this message
Austin Sun (sunausti) wrote :

Thanks. Yvonne for update.

Revision history for this message
Bill Zvonar (billzvonar) wrote :

We agreed to downgrade this to Medium, and will not hold up the stx.5.0 release for it.

Changed in starlingx:
importance: High → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on the stx release meeting, moving to the next release since this will not hold up stx.5.0

tags: added: stx.6.0
removed: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/790321
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/b7308b6107924b5e168885589cd9b2f6a0d21a0d
Submitter: "Zuul (22348)"
Branch: master

commit b7308b6107924b5e168885589cd9b2f6a0d21a0d
Author: Chen, Haochuan Z <email address hidden>
Date: Sat May 8 14:46:17 2021 +0800

    Fix oslo_db config mismatch error

    From openstack queen, oslo_db change default idle_time to 3600s
    https://github.com/openstack/oslo.db/blob/c49c96ff74190b9f1b1504eaa8a7ef66b8ae333f/oslo_db/options.py#L52
    which makes oslo_db timeout setting mismatch with mysqld in mariadb.
    Update setting in mariadb.

    Below two openstack issue report similar problem, check discuss and
    fix proposal in issue comment
    https://bugs.launchpad.net/oslo.db/+bug/1774544
    https://bugs.launchpad.net/openstack-ansible/+bug/1787517

    Closes-Bug: 1922299

    Change-Id: I8e41884f460f45ace7b7da8df87867ec32626c98
    Signed-off-by: Chen, Haochuan Z <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.