wsrep_slave_threads >1 causes foreign key constraint violations

Bug #1823850 reported by Trent Lloyd on 2019-04-09
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
percona-xtradb-cluster-5.6 (Ubuntu)
Undecided
Unassigned

Bug Description

When running OpenStack against Percona XtraDB 5.6 (Xenial) we observe foreign key violation crashes on the slave servers when using wsrep_slave_threads > 1. This happens semi-regularly every few days in at least 1 production environment.

Unfortunately it is not straight forward to reproduce in a test environment and likely requires a fairly performant test environment to reproduce the race. It was done many times in production particularly when deploying large heat templates against an OpenStack cloud but do not currently have a test case or even precise OpenStack steps to reproduce the issue. I suggest we can try using rally or otherwise concuct some heat templates to reproduce the issue.

The servers in question were also running on HDD storage.

I did find the following similar bug though without a lot of detail:
https://jira.mariadb.org/browse/MDEV-13246

2018-10-08 02:26:11 283550 [ERROR] Slave SQL: Could not execute Delete_rows event on table heat.raw_template; Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`stack`, CONSTRAINT `stack_ibfk_2` FOREIGN KEY (`prev_raw_template_id`) REFERENCES `raw_template` (`id`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event's master log FIRST, end_log_pos 12099, Error_code: 1451
2018-10-08 02:26:11 283550 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 90837000
2018-10-08 02:26:11 283550 [Warning] WSREP: Failed to apply app buffer: seqno: 90837000, status: 1
         at galera/src/trx_handle.cpp:apply():351
Retrying 2th time

Trent Lloyd (lathiat) on 2019-04-09
Changed in percona-xtradb-cluster-5.6 (Ubuntu):
status: New → Confirmed
Mario Splivalo (mariosplivalo) wrote :

Hi, Trent.

Is this bug related:

https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1692745

I remembered hitting this earlier on 5.6, and was not able to reproduce it on 5.7, but it should, since then, be fixed.

It especially happens when there is network latency between PXC units. Running specific rally tests yielded the issue usually somewhere between 20 minutes and 3 hours into the test.

But, as seen above, that issue has been fixed. I'll try to re-do the testing and confirm that.

Trent Lloyd (lathiat) wrote :

Mario advised me that he tested the latest upstream Percona XtraDB Cluster 5.6 release (5.6.43-28.32) against his reproducer (using openstack rally) which causes foreign key violations even with wsrep_slave_threads=1 that it does not reproduce on the latest upstream release.

There is no clear changelog entry in the XtraDB Cluster changelog itself detailing this (I had already checked, and now double checked) however unfortunately what is not clear is that when XtraDB Cluster is updated it also updates several other components with their own changelogs
 - Percona Server (5.6.43-84.3)
 - Codership WSREP API (5.6.42)
 - Codership Galera (3.25)

So to find the specific bug fix we would need to analyse all 3 changelogs.

Will look at either if that is feasible or if we can justify a microversion update exception since that has already been done at least once.

Note that some components are in a separate source package, i.e. percona-galera

Mario Splivalo (mariosplivalo) wrote :

Hi, all.

So, just to leave my comment here:

5.6.37-26.21-0ubuntu0.16.04.2 <- xenial, ubuntu archives - this version fails rally tests, that is, bug LP: #1692745 still applies to xenial.

5.6.43-28.32 <- this version has LP: #1692745 fixed. I am currently testing to see if wsrep_slave_threads can be safely set to more than 1 on that version too. I'll update this bug with more info as soon as I have ti.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers