Percona Server moved to https://jira.percona.com/projects/PS

long semaphore wait after upgrade and enabling GTID replication

Bug #1391660 reported by shinguz on 2014-11-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Percona Server moved to https://jira.percona.com/projects/PS	New	Undecided	Unassigned

Bug Description

After upgrade from Percona Server 5.5.38 to 5.6.21 and enabling GTID replication after a while we had the problem of long semaphore waits.
After a restart of the database and disabling GTID replication again the problem disappeared for a while. But some hours later the problem occured again starting with Max connection reached error and long semaphore waits.

See shortened mysql error log.

We are waiting for the next occurance and consider a downgrade again.

Tags:

Revision history for this message

shinguz (oli-sennhauser) wrote on 2014-11-11:

error_log_for_bug_report.txt.gz Edit (30.2 KiB, text/plain)

Revision history for this message

Valerii Kravchuk (valerii-kravchuk) wrote on 2014-11-12:

Thank you for the problem report. Please, share your my.cnf file content or the output of SHOW GLOBAL VARIABLES from this instance.

I see the following in the error log:

Main thread process no. 11074, id 139789062067968, state: enforcing dict cache limit

So, main thread is calling srv_master_evict_from_table_cache() and thus also calls dict_mutex_enter_for_mysql().

Long waits ( lock_wait_mutex_enter();) happen in lock_wait_suspend_thread() and lock_wait_table_release_slot(). There must be some other thread that holds that mutex. The only other place where it's taken, as far as I can see, is in

os_thread_ret_t
DECLARE_THREAD(lock_wait_timeout_thread)()

function.

So, somehow I think you maybe hit some memory limit for the number of transactions and locks used. See also these upstream reports for any similarity to your use case:

http://bugs.mysql.com/bug.php?id=71511
http://bugs.mysql.com/bug.php?id=73322

Both were reported for 5.6 and have unclear status.

If this long wait happens again, please, get the output of pt-pmp, so that we get more information about call sequences in threads. As a workaround I suggest

Changed in percona-server:
status:	New → Incomplete

Revision history for this message

Jörg Brühe (joerg-bruehe) wrote on 2014-11-13:

1391660.cnf Edit (5.0 KiB, text/plain)

This is the requested "/etc/my.cnf" file.

Valerii Kravchuk (valerii-kravchuk) on 2014-11-13

Changed in percona-server:
status:	Incomplete → New

Revision history for this message

Rick Pizzi (pizzi) wrote on 2015-09-26:

I am not using GTIDs but this bug seem to have bitten me right away.
Entire INNODB subsystem locked , with main thread as a blocker (holding a lock), in state: enforcing dict cache limit.
After 600 seconds InnoDB surrendered with an assertion, and server was restarted my mysqld_safe.

Looking at the error log, what puzzles me is the following (139627810412288 is the main thread, the blocker):

--Thread 139627810412288 has waited at srv0srv.cc line 2596 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x134c460 '&dict_operation_lock'
a writer (thread id 139627810412288) has reserved it in mode wait exclusive
number of readers 7, waiters flag 1, lock_word: fffffffffffffff9
Last time read locked in file row0ins.cc line 1803
Last time write locked in file /mnt/workspace/percona-server-5.6-redhat-binary/label_exp/centos6-64/rpmbuild/BUILD/percona-server-5.6.25-73.1/storage/innobase/dict/dict0stats.cc line 2385

So, from the above it seems that the blocker was blocked by itself. This seems quite odd!!

Any thoughts?

Running Percona Server 5.6.25-73.1-log .

Thanks
Rick

Revision history for this message

Rick Pizzi (pizzi) wrote on 2015-09-26:

error log with innodb status and my.cnf Edit (44.8 KiB, application/zip)

Adding output of innodb status (as appended to the error log) and my.cnf

Revision history for this message

Rick Pizzi (pizzi) wrote on 2015-09-26:

At the time the problem occurred there was an online schema change running and about to complete.
The table being altered was a very large (1 billion rows) partitioned one, it is likely that the OSC tool was about to execute or executing the atomical rename of tables, though this is just a speculation as the operation never completed, and therefore is not logged in binary logs or anywhere else.

Revision history for this message

Rick Pizzi (pizzi) wrote on 2015-09-27:

Filed a separate bug about this issue, please see https://bugs.launchpad.net/percona-server/+bug/1500176