kill_long_selects fails on slow Jenkins slaves

Bug #1214274 reported by Alexey Kopytov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraBackup moved to https://jira.percona.com/projects/PXB
Fix Released
Medium
Sergei Glushchenko
2.1
Fix Released
Medium
Sergei Glushchenko
2.2
Fix Released
Medium
Sergei Glushchenko

Related branches

Revision history for this message
Alexey Kopytov (akopytov) wrote :

Sergei, can you take a look?

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

- case 1 failure. We forked select and update queries, and run xtrabackup.
  xtrabackup started before forked processes started.
  * possible solution is to introduce synchronization in test case between
    forked queries and main execution thread
 * more simple solution is to increase timeouts in xtrabackup parameters to
   very large value
- case 1 failure. Mysql server has gone away query killing has been finished.
  Looks like xtrabackup killed it's own connection.
  * possible solution is to check if connection belongs to xtrabackup and don't
    kill it.

Both failures are case 1. However 2 may happen not only for case 1.
I propose to proceed with simple solution for case 1. Case 2 shouldn't be hard to implement either.

Revision history for this message
Alexey Kopytov (akopytov) wrote :

Sergei,

I have to admit I don't understand anything from the above:

- what select and update queries are "forked" by xtrabackup?
- why does xtrabackup starting before forked processes are forked lead to a test failure? Which from those 2 failures does this case correspond to?
- what is case #2 (the comment mentions case 1 twice)
- how can case #2 only happen for case #1?
- how can xtrabackup "kill" its own connection? innobackupex may kill processes (but if it killed itself, it would fail differently?), and xtrabackup doesn't "own" any connections and doesn't kill anything.

Can you clarify?

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Hi Alexey,

> I have to admit I don't understand anything from the above:

I was not clear enough.

> - what select and update queries are "forked" by xtrabackup?

Not by xtrabackup, but by test case. Test case executes selects/updates in background and xtrabackup should kill or not to kill them depending on arguments specified.

> - why does xtrabackup starting before forked processes are forked lead to a test failure? Which from those 2 failures does this case correspond to?

Xtrabackup may start before forked (by test) queries started justs because nothing prevents it.

> - what is case #2 (the comment mentions case 1 twice)
> - how can case #2 only happen for case #1?

I mixed case 1 (which is under ====== case 1 ====== section of test) and two cases of failure :)
Lets refer the one with innobackupex started early as "a)" - it is the "ubuntu-raring-64bit".
"b)" - centos5-32.

a) tried to test following
   - run update query which lasts 3 seconds
   - run select query witch lasts 3 seconds
   - run innobackupex which will kill all queries 5 seconds after FTWRL
   - as a result both queries and xtrabackup should succeed
   This is failed because innobackupex started to work earlier that queries in background and killed one of them.

> - how can xtrabackup "kill" its own connection? innobackupex may kill processes (but if it killed itself, it would fail differently?), and xtrabackup doesn't "own" any connections and doesn't kill anything.

I referred xtrabackup as product. innobackupex killed it's own connection, and the error is
innobackupex: Error:
Error executing 'SHOW MASTER STATUS': DBD::mysql::db selectrow_hashref failed: MySQL server has gone away

> Can you clarify?

I hope now things become little bit more clear.

Revision history for this message
Alexey Kopytov (akopytov) wrote : Re: [Bug 1214274] Re: kill_long_selects fails on slow Jenkins slaves

Hi Sergei,

Thanks, it is makes more sense now.

On Wed, 21 Aug 2013 14:16:41 -0000, Sergei Glushchenko wrote:
> Hi Alexey,
>
>> I have to admit I don't understand anything from the above:
>
> I was not clear enough.
>
>> - what select and update queries are "forked" by xtrabackup?
>
> Not by xtrabackup, but by test case. Test case executes selects/updates
> in background and xtrabackup should kill or not to kill them depending
> on arguments specified.
>
>> - why does xtrabackup starting before forked processes are forked lead
> to a test failure? Which from those 2 failures does this case correspond
> to?
>
> Xtrabackup may start before forked (by test) queries started justs
> because nothing prevents it.
>
>> - what is case #2 (the comment mentions case 1 twice)
>> - how can case #2 only happen for case #1?
>
> I mixed case 1 (which is under ====== case 1 ====== section of test) and two cases of failure :)
> Lets refer the one with innobackupex started early as "a)" - it is the "ubuntu-raring-64bit".
> "b)" - centos5-32.
>
> a) tried to test following
> - run update query which lasts 3 seconds
> - run select query witch lasts 3 seconds
> - run innobackupex which will kill all queries 5 seconds after FTWRL
> - as a result both queries and xtrabackup should succeed
> This is failed because innobackupex started to work earlier that queries in background and killed one of them.
>

OK, so it is a pure test case problem. I don't see a way to implement
the necessary level of synchronization to make the test fully
deterministic. So let's just increase innobackupex timeouts.

>> - how can xtrabackup "kill" its own connection? innobackupex may kill
> processes (but if it killed itself, it would fail differently?), and
> xtrabackup doesn't "own" any connections and doesn't kill anything.
>
> I referred xtrabackup as product. innobackupex killed it's own connection, and the error is
> innobackupex: Error:
> Error executing 'SHOW MASTER STATUS': DBD::mysql::db selectrow_hashref failed: MySQL server has gone away
>

This one looks suspicious. Any ideas how could that happen? I.e. how
could the innobackupex connection pass the is_query() /
is_select_query() checks in kill_long_queries()?

Besides I don't see any "Killing query ..." messages in the log. Is
there any chance they are not actually printed when a query is killed?

Anyway, this is a functionality issue and should be reported separately.
Can you analyze and report it?

Revision history for this message
Alexey Kopytov (akopytov) wrote :

Here's another failure with different symptoms:

http://jenkins.percona.com/view/XtraBackup/job/percona-xtrabackup-2.1/152/BUILD_TYPE=debug,Host=ubuntu-lucid-32bit,xtrabackuptarget=xtradb56/testReport/junit/%28root%29/t_kill_long_selects/sh/

2013-08-21 23:46:24: run.sh: ===> /usr/bin/mysql --defaults-file=/home/jenkins/workspace/percona-xtrabackup-2.1/BUILD_TYPE/debug/Host/ubuntu-lucid-32bit/xtrabackuptarget/xtradb56/test/var/w1/var1/my.cnf test
ERROR 1317 (70100) at line 1 in file: '/home/jenkins/workspace/percona-xtrabackup-2.1/BUILD_TYPE/debug/Host/ubuntu-lucid-32bit/xtrabackuptarget/xtradb56/test/var/w1/var1/tmp/killall.sql': Query execution was interrupted
2013-08-21 23:46:29: run.sh: ===> mysql failed with exit code 1

I guess it's also a test case issue, so it can be fixed together with case a) from above comments.

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Yes, the last one is the test case issue.

Regarding the innobackupex issue, I think I found what caused it. FTWRL was acquired immediately and main thread sent HUP signal to child process which responsible for killing queries to stop it. Signal was sent before child set up it's signal handler and set InactiveDestroy property for it's copy of parent's database connection handler.

Revision history for this message
Alexey Kopytov (akopytov) wrote :

Sergei, can you report it as a separate bug? Can we fix it with AutoInactiveDestroy?

You can also initialize 'end' to 1, but then reset it to 0 just before the 'while' loop. I'm not sure if that will remove the race between signal handler setup code in the child process and the SIGHUP sending in the parent completely, but should certainly reduce its probability.

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-659

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.