long semaphore wait after upgrade and enabling GTID replication

Bug #1391660 reported by shinguz
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
New
Undecided
Unassigned

Bug Description

After upgrade from Percona Server 5.5.38 to 5.6.21 and enabling GTID replication after a while we had the problem of long semaphore waits.
After a restart of the database and disabling GTID replication again the problem disappeared for a while. But some hours later the problem occured again starting with Max connection reached error and long semaphore waits.

See shortened mysql error log.

We are waiting for the next occurance and consider a downgrade again.

Revision history for this message
shinguz (oli-sennhauser) wrote :
Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Thank you for the problem report. Please, share your my.cnf file content or the output of SHOW GLOBAL VARIABLES from this instance.

I see the following in the error log:

Main thread process no. 11074, id 139789062067968, state: enforcing dict cache limit

So, main thread is calling srv_master_evict_from_table_cache() and thus also calls dict_mutex_enter_for_mysql().

Long waits ( lock_wait_mutex_enter();) happen in lock_wait_suspend_thread() and lock_wait_table_release_slot(). There must be some other thread that holds that mutex. The only other place where it's taken, as far as I can see, is in

os_thread_ret_t
DECLARE_THREAD(lock_wait_timeout_thread)()

function.

So, somehow I think you maybe hit some memory limit for the number of transactions and locks used. See also these upstream reports for any similarity to your use case:

http://bugs.mysql.com/bug.php?id=71511
http://bugs.mysql.com/bug.php?id=73322

Both were reported for 5.6 and have unclear status.

If this long wait happens again, please, get the output of pt-pmp, so that we get more information about call sequences in threads. As a workaround I suggest

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Jörg Brühe (joerg-bruehe) wrote :

This is the requested "/etc/my.cnf" file.

Changed in percona-server:
status: Incomplete → New
Revision history for this message
Rick Pizzi (pizzi) wrote :

I am not using GTIDs but this bug seem to have bitten me right away.
Entire INNODB subsystem locked , with main thread as a blocker (holding a lock), in state: enforcing dict cache limit.
After 600 seconds InnoDB surrendered with an assertion, and server was restarted my mysqld_safe.

Looking at the error log, what puzzles me is the following (139627810412288 is the main thread, the blocker):

--Thread 139627810412288 has waited at srv0srv.cc line 2596 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x134c460 '&dict_operation_lock'
a writer (thread id 139627810412288) has reserved it in mode wait exclusive
number of readers 7, waiters flag 1, lock_word: fffffffffffffff9
Last time read locked in file row0ins.cc line 1803
Last time write locked in file /mnt/workspace/percona-server-5.6-redhat-binary/label_exp/centos6-64/rpmbuild/BUILD/percona-server-5.6.25-73.1/storage/innobase/dict/dict0stats.cc line 2385

So, from the above it seems that the blocker was blocked by itself. This seems quite odd!!

Any thoughts?

Running Percona Server 5.6.25-73.1-log .

Thanks
Rick

Revision history for this message
Rick Pizzi (pizzi) wrote :

Adding output of innodb status (as appended to the error log) and my.cnf

Revision history for this message
Rick Pizzi (pizzi) wrote :

At the time the problem occurred there was an online schema change running and about to complete.
The table being altered was a very large (1 billion rows) partitioned one, it is likely that the OSC tool was about to execute or executing the atomical rename of tables, though this is just a speculation as the operation never completed, and therefore is not logged in binary logs or anywhere else.

Revision history for this message
Rick Pizzi (pizzi) wrote :

Filed a separate bug about this issue, please see https://bugs.launchpad.net/percona-server/+bug/1500176

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3231

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.