MySQL deadlock issues with simultaneous users

Bug #1514608 reported by Aaron Wells on 2015-11-09
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mahara
High
Aaron Wells
1.10
High
Aaron Wells
15.04
High
Aaron Wells
15.10
High
Aaron Wells

Bug Description

The patch in Bug 1492919 resolved the deadlock issues for some users, but others are still reporting that they are getting deadlocks. This seems to be an issue unique to MySQL, and it can happen with as few as 2 or 3 concurrent users.

It has been reported across all stable versions. Strangely, it seems to be a relatively recent issue (i.e. we never heard this with Mahara 1.8 or 1.7, etc), so this suggests it was triggered by something that changed fairly recently. One possible candidate is our switch to the mysqli driver (before that Mahara wasn't doing transactions at all, in mysql).

See also:
* https://mahara.org/interaction/forum/topic.php?id=7309
* https://mahara.org/interaction/forum/topic.php?id=7432

Revision history for this message
Aaron Wells (u-aaronw) wrote :

Most of our reports are for sites on 15.04, and it seems to get worse when there are more users. It's possible this was some sort of underlying deadlock potential in Mahara that was made worse by some low-level code change in 15.04, like something that gets triggered by init.php or some other frequent bit of code, which was not triggered as frequently before.

description: updated
Changed in mahara:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Aaron Wells (u-aaronw)
Revision history for this message
Russell Boyatt (russell-boyatt) wrote :

I have seen the error from one system where a deadlock occurred even with the transaction isolation change. That error was in a different part of the code and related to the triggers for quota checks.

This had been where I had started to look again at the transaction behaviour in general with Mahara and mysql. Unfortunately I've not managed to replicate the deadlock error on a system I have access to.

Does somebody have a reproducible case?

Aaron Wells (u-aaronw) on 2015-11-09
Changed in mahara:
milestone: none → 15.04.5
Aaron Wells (u-aaronw) on 2015-11-10
description: updated
description: updated
Revision history for this message
Aaron Wells (u-aaronw) wrote :

It sounds like multiple people copying a collection at the same time, is the most consistent cause of the problem.

Revision history for this message
Aaron Wells (u-aaronw) wrote :
Download full text (5.0 KiB)

Okay, I was able to replicate this locally by hand, by having multiple users copy the same collection at the same time, and adding a command of "sleep(1)" into the execute_sql() function in lib/dml.php to increase the amount by which the queries overlap.

1. Set up a Mahara site with five users.
2. Have one of the users create a collection with several pages and artefacts.
3. Set the collection so that it is shared with all registered users, and it can be copied, and the owner retains view access to copies.
4. In four browser simultaneously:
4.1. Log in as one of the other four users.
4.2. View the copyable collection
4.3. Click the "copy" button (but don't yet click "Collection" in the confirmation popup)
5. Edit lib/dml.php, adding "sleep(1);" on line 138, inside of the execute_sql() function
6. In each of the four browsers in quick succession, click the "Collection" button.

Expected result: It should copy the collection successfully in all four sessions.
Actual result: Some of the sessions get a deadlock error message. In my case, two of them did, and they both deadlocked while trying to insert into the "view_rows_columns" table.

[WAR] d9 (lib/errors.php:747) Failed to get a recordset: mysqli error: [1213: Deadlock found when trying to get lock; try restarting transaction] in adodb_throw(INSERT INTO "view_rows_columns" ("view", "row", "columns") VALUES (64, 1, 3), )
Command was: INSERT INTO "view_rows_columns" ("view", "row", "columns") VALUES (?, ?, ?) and values was (64,1,3)
Call stack (most recent first):
log_message("Failed to get a recordset: mysqli error: [1213: De...", 8, true, true) at /home/aaronw/www/mahara/htdocs/lib/errors.php:97
log_warn("Failed to get a recordset: mysqli error: [1213: De...") at /home/aaronw/www/mahara/htdocs/lib/errors.php:747
SQLException->__construct("Failed to get a recordset: mysqli error: [1213: De...") at /home/aaronw/www/mahara/htdocs/lib/dml.php:1055
insert_record("view_rows_columns", object(stdClass)) at /home/aaronw/www/mahara/htdocs/lib/view.php:756
View->commit() at /home/aaronw/www/mahara/htdocs/lib/view.php:581
View::_create(array(size 5), "4") at /home/aaronw/www/mahara/htdocs/lib/view.php:377
View::create_from_template(array(size 5), "12", "4", true, false) at /home/aaronw/www/mahara/htdocs/lib/collection.php:286
Collection::create_from_template(array(size 3), 1) at /home/aaronw/www/mahara/htdocs/lib/view.php:6141
copyview("6", 0, null, 1) at /home/aaronw/www/mahara/htdocs/view/copy.php:32
[WAR] d9 (lib/errors.php:747) Failed to get a recordset: mysqli error: [1452: Cannot add or update a child row: a foreign key constraint fails ("mahara"."view_rows_columns", CONSTRAINT "viewrowscolu_vie_fk" FOREIGN KEY ("view") REFERENCES "view" ("id"))] in adodb_throw(INSERT INTO "view_rows_columns" ("view", "row", "columns") VALUES (64, 1, 3), )
Command was: INSERT INTO "view_rows_columns" ("view", "row", "columns") VALUES (?, ?, ?) and values was (64,1,3)
Call stack (most recent first):
log_message("Failed to get a recordset: mysqli error: [1452: Ca...", 8, true, true) at /home/aaronw/www/mahara/htdocs/lib/errors.php:97
log_warn("Failed to get a recordset: mysqli error: [1452: Ca...") at /home/aaronw/w...

Read more...

Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

Aaron, does this bug also need to be targeted against 15.10 and master?

Revision history for this message
Aaron Wells (u-aaronw) wrote :

I suppose it could be. I don't know if we've had any reports of it in 15.10 and master yet. Once I get my test automated into JMeter I can start checking for sure which versions are affected.

Revision history for this message
Russell Boyatt (russell-boyatt) wrote :

I've attached a copy of the JMeter test plan I've used to test copying collections. You'll need a file specifying some users to make use of it (see 'Mahara users CSV' element in the test plan).

Adding a sleep to execute_sql() didn't make any difference from me (either with jmeter or a 4 user browser test) but I probably have a much smaller test db than you which might be making enough of a difference.

Did the deadlock messages from 'show engine innodb status' give anything useful?

Looking at the page/collection page logic, one thing it does is delete all rows in view_rows_columns before reinserting for a given view. If nothing exists in view_rows_columns for that view, the deletes would cause gap locks at the end of table. Which could then cause the following inserts to view_rows_columns to fail in the deadlock state we've seen.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

Thanks Russell. I'm completely new to JMeter, so having that test plan to start from should save me a lot of time!

Unfortunately I didn't think to check "show engine innodb status" after running that deadlock (because I figure I could tell from the PHP errors which query it had died on and that was enough). And since then I have dropped that database. I'll make sure to check the next time the problem happens, though. Or perhaps I'll turn on that innodb_print_all_deadlocks setting it mentions in the MySQL manual.

From reading what the MySQL manual has to say about deadlocks, it seems like our underlying problem in Mahara is that we're doing the entire collection copy operation inside of a transaction. And that copy operation may take a while, depending on how many pages and how much content is in the collection, the speed of the DB connection, and whether it requires copying files in dataroot as well. Just having a long-running transaction makes it a lot more likely that we'll hit deadlock problems.

Although it's strange to me, because I have not heard of any deadlock problems with Mahara in Postgres. Instead, the site tends to just slow down under heavy usage. Although perhaps that's Postgres being willing to wait longer for the deadlock to clear?

Once I'm able to replicate this with JMeter I'll explore some possible solutions. The most "brute-force" one would be to disable transactions entirely (maybe with a config-defaults.php option), if this actually is a problem caused by long-running transactions.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

(For those following along, here's the MySQL manual page about deadlocks which I was referring to in the previous comment: http://dev.mysql.com/doc/refman/5.7/en/innodb-deadlocks.html )

Mahara's sanity check forces you to use InnoDB if you're on MySQL.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

Okay, I have been able to consistently replicate the error in JMeter, after a little tweaking of Russell's JMeter test plan. With a one-second sleep in place in the ADODB "Execute" method, I get deadlocks with as few as 4 concurrent users.

Without the delay, I consistently get a deadlock if I do tests of 100 concurrent users.

To make the results easier to scan in JMeter I've also patched my local Mahara instance so that it returns a 403 error when it displays the transient login page, and a 500 error when there's an unhandled exception or error. To do this, put "http_response_code(403);" in the top of the auth_draw_login_page() method in htdocs/auth/lib.php, and put "http_response_code(500);" inside the error() and exception() methods in htdocs/lib/errors.php. Without that, JMeter shows them all as successes, because there's a 200 response. (I guess I could also probably use a regex against the response data, but I found this to be easier.)

If I disabled transactions (by putting "return;" as the first line of db_begin(), db_commit(), and db_rollback() in htdocs/lib/dml.php), then it finished successfully with the 100 concurrent users. So, that's sort of the "Cutting the Gordean knot" solution to this problem.

I tried the same test case in Mahara 1.8, which was the first version of Mahara to use the "mysqli" extension. Prior to that, Mahara didn't do transactions at all in mysql because it was using the older "mysql" extension. In 1.8 I also got deadlocks, although it took longer for them to start.

I also tried running 100 concurrent users through the same test plan in a Postgres Mahara. Some users did error out due to a strange error inserting into the view_tags table, but they did not deadlock. So perhaps if we can tweak InnoDB to act more like Postgresql in regards to locking, that may be a solution, at least for those who can tweak their server settings.

Tomorrow I'll take a look at the actual places we're deadlocking in more fine grain and see if I can do anything about those.

Revision history for this message
Russell Boyatt (russell-boyatt) wrote :

Great, pleased that was helpful. The JMeter plan is fairly unsophisticated, a better plan would parse the response and adjust accordingly. In the mean time your response code approach is a good suggestion.

Out of interest, is it easier to trigger the deadlock problems if you drop the transaction isolation level back down to REPEATABLE READ?

Deadlock info from "show engine innodb status" would be very helpful in identifying the root cause of the deadlocks. Even if the decision is to go transaction-free for MySQL....it might be useful further down the line.

When I looked at this before I don't think I quite appreciated how much other stuff is happening between the db_begin() and db_commit(). So moving file operations out of this would be good anyway.

> Although it's strange to me, because I have not heard of any deadlock problems with Mahara in Postgres.
> Instead, the site tends to just slow down under heavy usage. Although perhaps that's Postgres being
> willing to wait longer for the deadlock to clear?

Maybe but PostgreSQL and MySQL have different strategies for dealing with transactions and overlapping updates. The other thing I've just spotted is that slightly different SQL is being used on the two platforms. There are some 'FOR UPDATE' clauses which are only applied in the case of PostgreSQL (in lib/dml.php) which might have some impact. Will investigate...

And while I think of it, MySQL 5.7 has a number of changes that might be important later. See the details around "index lock contention" on http://mysqlserverteam.com/whats-new-in-mysql-5-7-so-far/

Revision history for this message
Aaron Wells (u-aaronw) wrote :

Okay, after further testing today I've confirmed that disabling Mahara's usage of explicit database transactions (i.e. "start transaction;" and "commit;") in MySQL resolves this problem. On the test server I was using, I was getting deadlocks with as few as 4 concurrent users copying a collection. After disabling transactions, I was able to run 50 concurrent users without any errors.

At that point it was taking 8.5 minutes for the copy operation to finish, so I didn't test any higher than that. So, we still have some performance & scaling work we could do, but it solves the problem for deadlocks. :)

With further testing on the same server I also verified that the earlier patch changing the transaction isolation level from "REPEATABLE READ" to "READ COMMITTED" helps prevent deadlocks as well. When I disabled explicit transactions, and also rolled back to the default "REPEATABLE READ", I got deadlocks when I had 15 concurrent users copying the collection.

So I think Mahara's immediate fix for the next minor release, has go to to be:

1. Keeping the "READ COMMITTED" patch in place.
2. Disabling Mahara's usage of explicit transactions, in MySQL.

In the longer term we can look into all the places we use transactions, and maybe remove some of them. We shouldn't be putting huge operations, like the copying of a collection, inside of a database transaction. At least, not for MySQL, unless we can figure out how to make InnoDB act a lot more like Postgres in this regard.

Cheers,
Aaron

Revision history for this message
Aaron Wells (u-aaronw) wrote :

Also, here's the innodb engine status from when I experienced a deadlock on the test server, before I disabled explicit transactions.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

And here's the innodb engine status from after I disabled explicit transactions, and I set the transaction isolation level back to "REPEATABLE READ".

Revision history for this message
Aaron Wells (u-aaronw) wrote :

And here's an updated version of that JMeter test plan. It turns out there were some bugs in the "copy collection" step, which would cause it to error out if the id of the collection you were copying was not "1". I hadn't noticed this before because I was running it against a newly installed Mahara, where my test collection was indeed collection 1.

Revision history for this message
Russell Boyatt (russell-boyatt) wrote :

Yes, maybe disabling transactions for MySQL is the way to go for the moment. At least until the longer operations currently in one transactions can be looked at in more detail. Other than copy page/collection, what other operations might be long running? I'd be happy to build some JMeter test plans as it's likely they'll be useful.

Are those innodb engine status messages from the system where you encountered the deadlock? Neither actually show a deadlock error in the database, just the list of currently running transactions. I'd expect a section in there called 'LATEST DETECTED DEADLOCK' (it will also give a timestamp as the first line).

Revision history for this message
Aaron Wells (u-aaronw) wrote :

I did a search through the Mahara code base for "db_start()", and found over 200 places where we use it. Some of these are reasonable, a few SQL queries together in the same function. Others are quite wide-ranging, like the collection copy. Or they do things like trigger events before calling db_commit(), which, depending on the registered event handlers, can lead to a lot of code getting executed.

That's interesting, those deadlock log files are definitely from the DB that I was running Mahara against (I made direct DB updates and saw them reflected in the web application). And the copy task did error out, with the errors coming from the mysqli driver saying "deadlock": [1213: Deadlock found when trying to get lock; try restarting transaction]

But perhaps mysqli means something different by "deadlock" than the MySQL InnoDB engine does? Or maybe there's some kind of logging setting in this DB that's making it not show up?

Revision history for this message
Mahara Bot (dev-mahara) wrote : A patch has been submitted for review

Patch for "master" branch: https://reviews.mahara.org/5723

Revision history for this message
Russell Boyatt (russell-boyatt) wrote :

Hmmm...."Error 1213: Deadlock found when trying to get lock" is from MySQL rather than from mysqli. So the error message has come from the db at some point but that seems at odds with the lack of a latest deadlock section in 'show engine innodb status'. I'm not aware of an option that if enabled would not report deadlocks in this way.

In your test cases, if you enable 'innodb_print_all_deadlocks' option do you see a deadlock reported in the mysql error log? (With isolation level at read commited, I'm afraid I can't still trigger a deadlock for copy pages/collections)

Revision history for this message
Aaron Wells (u-aaronw) wrote :

Hi Russel,

Here is the collection that I'm copying in my tests. (It's a random assortment of the pages in my mahara.org account). Maybe testing with this will help? Or maybe your computer is just too good? ;) On my local machine it took 100 concurrent users copying the collection before I had issues, whereas on the client test server I did further testing on, it only took 10 concurrent users. So, the environment details certainly seem to make a difference.

Cheers,
Aaron

Revision history for this message
Mahara Bot (dev-mahara) wrote : A change has been merged

Reviewed: https://reviews.mahara.org/5723
Committed: https://git.mahara.org/mahara/mahara/commit/e52be8a3135c19b2ed306a5e16fdc0101d756f6d
Submitter: Aaron Wells (<email address hidden>)
Branch: master

commit e52be8a3135c19b2ed306a5e16fdc0101d756f6d
Author: Aaron Wells <email address hidden>
Date: Tue Nov 17 11:20:54 2015 +1300

Disable transactions in MySQL (Bug 1514608)

Mahara's usage of transactions has been causing
major deadlock issues in MySQL with InnoDB.
The immediate fix is to disable them.

behatnotneeded: Covered by existing tests

Change-Id: I2c7e6424e2d58502acd26fdd5d05c797e4ee5eac

Revision history for this message
Mahara Bot (dev-mahara) wrote : A patch has been submitted for review

Patch for "15.10_STABLE" branch: https://reviews.mahara.org/5774

Revision history for this message
Mahara Bot (dev-mahara) wrote :

Patch for "15.04_STABLE" branch: https://reviews.mahara.org/5775

Revision history for this message
Mahara Bot (dev-mahara) wrote :

Patch for "1.10_STABLE" branch: https://reviews.mahara.org/5776

Revision history for this message
Mahara Bot (dev-mahara) wrote : A change has been merged

Reviewed: https://reviews.mahara.org/5776
Committed: https://git.mahara.org/mahara/mahara/commit/2279fc840134700f16f29f3eaa30b5a1198a95e6
Submitter: Aaron Wells (<email address hidden>)
Branch: 1.10_STABLE

commit 2279fc840134700f16f29f3eaa30b5a1198a95e6
Author: Aaron Wells <email address hidden>
Date: Tue Nov 17 11:20:54 2015 +1300

Disable transactions in MySQL (Bug 1514608)

Mahara's usage of transactions has been causing
major deadlock issues in MySQL with InnoDB.
The immediate fix is to disable them.

behatnotneeded: Covered by existing tests

Change-Id: I2c7e6424e2d58502acd26fdd5d05c797e4ee5eac
(cherry picked from commit e52be8a3135c19b2ed306a5e16fdc0101d756f6d)

Revision history for this message
Russell Boyatt (russell-boyatt) wrote :

Aaron,

Thanks very much for posting the copy of your test collection. Unfortunately I've still been unable to replicate the problem - and my test computer is certainly far from 'too good' :) So there must be some difference in my setup. I'll try to come back to in a week or so and have a fresh look to see if I can work out what environment details are making the difference. Longer term it would be good to understand what's going on. Until then disabling the transactions (in mysql) should do the trick!

Russ

Revision history for this message
Mahara Bot (dev-mahara) wrote :

Reviewed: https://reviews.mahara.org/5774
Committed: https://git.mahara.org/mahara/mahara/commit/393fd87f5561a59b466468b303b0ee5587a9cef9
Submitter: Aaron Wells (<email address hidden>)
Branch: 15.10_STABLE

commit 393fd87f5561a59b466468b303b0ee5587a9cef9
Author: Aaron Wells <email address hidden>
Date: Tue Nov 17 11:20:54 2015 +1300

Disable transactions in MySQL (Bug 1514608)

Mahara's usage of transactions has been causing
major deadlock issues in MySQL with InnoDB.
The immediate fix is to disable them.

behatnotneeded: Covered by existing tests

Change-Id: I2c7e6424e2d58502acd26fdd5d05c797e4ee5eac
(cherry picked from commit e52be8a3135c19b2ed306a5e16fdc0101d756f6d)

Revision history for this message
Mahara Bot (dev-mahara) wrote :

Reviewed: https://reviews.mahara.org/5775
Committed: https://git.mahara.org/mahara/mahara/commit/6ded76ccdd402cd2bf55f465f92a76a8e8add43a
Submitter: Robert Lyon (<email address hidden>)
Branch: 15.04_STABLE

commit 6ded76ccdd402cd2bf55f465f92a76a8e8add43a
Author: Aaron Wells <email address hidden>
Date: Tue Nov 17 11:20:54 2015 +1300

Disable transactions in MySQL (Bug 1514608)

Mahara's usage of transactions has been causing
major deadlock issues in MySQL with InnoDB.
The immediate fix is to disable them.

behatnotneeded: Covered by existing tests

Change-Id: I2c7e6424e2d58502acd26fdd5d05c797e4ee5eac
(cherry picked from commit e52be8a3135c19b2ed306a5e16fdc0101d756f6d)

no longer affects: mahara/16.04
Changed in mahara:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers