Stale data in INFORMATION_SCHEMA.INNODB_LOCKS

Bug #677407 reported by Kristian Nielsen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MariaDB
In Progress
High
Kristian Nielsen
Percona Server moved to https://jira.percona.com/projects/PS
Fix Released
High
Alexey Kopytov

Bug Description

This is MySQL Bug#48883 (http://bugs.mysql.com/bug.php?id=48883).

The XtraDB INFORMATION_SCHEMA.INNODB_PLUGIN table is populated in a two-step
procedure. First, a table cache is filled with the real information, under the
kernel_mutex. This is only done if at least 100 milliseconds has passed since
last filling the cache. Second, the MySQL table data is populated from the
cache, which may thus be up to 100 milliseconds old.

There is a bug in the accounting for how often to do the first step. The
timestamp for last update is updated in step 2, not in step 1. This has two
consequences:

1. If the INNODB_PLUGIN table is queried very frequently repeatedly (with less
than 100 milliseconds interval in-between), stale data can be returned
indefinitely, as the "last updated" timestamp is reset even though step 1 is
not run. This is the problem seen in Bug#48883, and in MariaDB Buildbot.

2. If multiple queries against INNODB_PLUGIN start at around the same time,
they can _all_ cause the cache to be updated, as the "last updated" timestamp
is not reset immediately after updating the cache. This can cause extra
unnecessary contention on kernel_mutex.

The failure is easy to repeat by changing the sleep in
mysql-test/include/wait_condition.inc from 0.1 to 0.01 seconds, and running
the innodb_information_schema.test.

I will attach a patch that fixes it.

Related branches

Revision history for this message
Kristian Nielsen (knielsen) wrote :

Patch for the problem, including BSD license for Percona for inclusion into Percona-Server.

Percona developers, please check the patch and consider it for inclusion (or let me
know if you see any problem with the patch).

I have committed the fix to XtraDB in MariaDB.

Changed in maria:
status: New → Fix Committed
assignee: nobody → Kristian Nielsen (knielsen)
importance: Undecided → Medium
importance: Medium → High
Changed in percona-server:
assignee: nobody → Alexey Kopytov (akopytov)
Percona (percona-team)
Changed in percona-server:
status: New → Confirmed
importance: Undecided → High
Percona (percona-team)
Changed in percona-server:
milestone: none → 5.1.53-11.7
Changed in percona-server:
status: Confirmed → In Progress
Revision history for this message
Alexey Kopytov (akopytov) wrote :

Pushed to release-5.1.53-11 and trunk.

Changed in percona-server:
status: In Progress → Fix Committed
Changed in maria:
milestone: none → 5.1
Revision history for this message
Vasil Dimov (vasild) wrote :

Hello Kristian,

I am the author of the 3 InnoDB INFORMATION_SCHEMA tables INNODB_TRX, INNODB_LOCKS and INNODB_LOCK_WAITS. I am afraid your analysis is wrong and the patch breaks the logic that prevents inconsistent data being fed to MySQL during JOIN.

First you are talking about the INNODB_PLUGIN table. I assume this is a typo and you actually mean "any of INNODB_TRX, INNODB_LOCKS or INNODB_LOCK_WAITS" instead.

You say "First, a table cache is filled with the real information, under the
kernel_mutex. This is only done if at least 100 milliseconds has passed since
last filling the cache." this is not true, it should read "if at least 100 milliseconds have passed since last READING the cache."

"Second, the MySQL table data is populated from the cache, which may thus be up to 100 milliseconds old", this is also not true. The cache may be older than 100ms.

See http://dev.mysql.com/doc/refman/5.5/en/innodb-information-schema-notes.html

Let me outline the problem that this cache is trying to solve. When the user issues SELECT from any of those I_S tables then MySQL asks the storage engine "fill this table with data for me". When the user issues a JOIN on two or more tables then MySQL will ask the storage engine to fill the tables independently of each other. That is - the storage engine does not know it is a JOIN.

This combined with the fact that the contents of each of those 3 tables could change very fast means that it is possible to feed inconsistent contents of the tables to MySQL when a JOIN is executed. This is highly undesirable because the JOIN output will then be malformed.

The solution implemented is not to _update_ the cache if it has been _read_ recently.

If you query any of those I_S tables more often than once per 0.1 seconds you will see data from the past. But if you really do this on a production server (why?) then that server will be stalled so much that it could become unusable.

"2. If multiple queries against INNODB_PLUGIN start at around the same time,
they can _all_ cause the cache to be updated, as the "last updated" timestamp
is not reset immediately after updating the cache. This can cause extra
unnecessary contention on kernel_mutex."

It is "last READ" timestamp, not "last updated". This is just fine and is per designed. From the doc:
"Because InnoDB must temporarily stall while the transaction and locking data is collected, too frequent queries of these tables can negatively impact performance as seen by other users"

Can you come up with an example where your patch breaks the above logic and leads to inconsistent data (from different cache snapshots) being fed to MySQL resulting in bogus JOIN result?

Thanks!

Revision history for this message
Kristian Nielsen (knielsen) wrote : Re: [Bug 677407] Re: Stale data in INFORMATION_SCHEMA.INNODB_LOCKS

Vasil Dimov <email address hidden> writes:

> Hello Kristian,

Hi Vasil, thanks for commenting on this issue!

> I am the author of the 3 InnoDB INFORMATION_SCHEMA tables INNODB_TRX,
> INNODB_LOCKS and INNODB_LOCK_WAITS. I am afraid your analysis is wrong
> and the patch breaks the logic that prevents inconsistent data being fed
> to MySQL during JOIN.

Yes, you are right, this is indeed documented behaviour:

    http://dev.mysql.com/doc/refman/5.5/en/innodb-information-schema-notes.html

So clearly my patch breaks this :-(

So I think I will need to revert the patch, thanks for pointing out the
problem.

When I first read your explanation, I thought it was hardly any better to be
consistent within 100 milliseconds, but inconsistent for larger intervals.
Eg. if someone joined two INFORMATION_SCHEMA.INNODB_XXX tables with one large
table in the middle, presumably there could still be inconsistencies.

But clearly, there is a common use-case, which is a monitoring tool doing
basic joins against just these INNODB_XXX tables, and in practise the 100
milliseconds work well I suppose. And it is documented behaviour (shame on me
for missing that). And a full transactional snapshot isolation implementation
for INNODB_XXX does seem a bit overkill.

----

So, the original problem here is this MySQL bug:

    http://bugs.mysql.com/bug.php?id=48883

The test case innodb.innodb_information_schema fails sporadically because of
this behaviour. I will find a better way to fix this failure.

> Can you come up with an example where your patch breaks the above logic
> and leads to inconsistent data (from different cache snapshots) being
> fed to MySQL resulting in bogus JOIN result?

Yes, I will try to put together a test case that fails with my patch. Though
it may be tricky due to the 100 milliseconds, a busy test server might get
inconsistent join anyway.

Thanks,

 - Kristian.

Revision history for this message
Vasil Dimov (vasild) wrote :

Hi,

Yes, the "write if only it has not been read for some time" solution is not 100% guarantee, but is a good enough approach.

Maybe the solution to the failing innodb_information_schema test is to inject "-- sleep 0.1" just before calling wait_until_rows_count.inc? Or actually the whole wait_until_rows_count.inc is useless because as soon as it starts it will query the I_S table very often and will thus "freeze" the cache. But that will also be non-deterministic.

Thanks!

Changed in maria:
status: Fix Committed → In Progress
Revision history for this message
Vasil Dimov (vasild) wrote :
Revision history for this message
Kristian Nielsen (knielsen) wrote :

Couldn't we make a similar solution without a magic "100msec" timeout?

For example, have something similar to a refcount on the cache, which counts the number of active statements that accessed the tables. The refcount would be incremented on first use of one of the tables in a statement, and decremented at statement end. And we would only refresh the cache if the refcount was zero. Or something like that, what do you think?

Revision history for this message
Vasil Dimov (vasild) wrote :

Hi,

I need to look at the API again, but I think it is very limited - MySQL just calls a "fill" function for a given I_S table. I was thinking about finding out the SQL statement string and parsing it to see if it is a JOIN but then decided this would be an infinite source of bugs and abandoned the idea :)

Changed in percona-server:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-444

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.