kill_long_selects fails on slow Jenkins slaves

Reported by Alexey Kopytov on 2013-08-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraBackup
Status tracked in 2.2
2.1
Medium
Sergei Glushchenko
2.2
Medium
Sergei Glushchenko
Alexey Kopytov (akopytov) wrote :

Sergei, can you take a look?

- case 1 failure. We forked select and update queries, and run xtrabackup.
  xtrabackup started before forked processes started.
  * possible solution is to introduce synchronization in test case between
    forked queries and main execution thread
 * more simple solution is to increase timeouts in xtrabackup parameters to
   very large value
- case 1 failure. Mysql server has gone away query killing has been finished.
  Looks like xtrabackup killed it's own connection.
  * possible solution is to check if connection belongs to xtrabackup and don't
    kill it.

Both failures are case 1. However 2 may happen not only for case 1.
I propose to proceed with simple solution for case 1. Case 2 shouldn't be hard to implement either.

Alexey Kopytov (akopytov) wrote :

Sergei,

I have to admit I don't understand anything from the above:

- what select and update queries are "forked" by xtrabackup?
- why does xtrabackup starting before forked processes are forked lead to a test failure? Which from those 2 failures does this case correspond to?
- what is case #2 (the comment mentions case 1 twice)
- how can case #2 only happen for case #1?
- how can xtrabackup "kill" its own connection? innobackupex may kill processes (but if it killed itself, it would fail differently?), and xtrabackup doesn't "own" any connections and doesn't kill anything.

Can you clarify?

Hi Alexey,

> I have to admit I don't understand anything from the above:

I was not clear enough.

> - what select and update queries are "forked" by xtrabackup?

Not by xtrabackup, but by test case. Test case executes selects/updates in background and xtrabackup should kill or not to kill them depending on arguments specified.

> - why does xtrabackup starting before forked processes are forked lead to a test failure? Which from those 2 failures does this case correspond to?

Xtrabackup may start before forked (by test) queries started justs because nothing prevents it.

> - what is case #2 (the comment mentions case 1 twice)
> - how can case #2 only happen for case #1?

I mixed case 1 (which is under ====== case 1 ====== section of test) and two cases of failure :)
Lets refer the one with innobackupex started early as "a)" - it is the "ubuntu-raring-64bit".
"b)" - centos5-32.

a) tried to test following
   - run update query which lasts 3 seconds
   - run select query witch lasts 3 seconds
   - run innobackupex which will kill all queries 5 seconds after FTWRL
   - as a result both queries and xtrabackup should succeed
   This is failed because innobackupex started to work earlier that queries in background and killed one of them.

> - how can xtrabackup "kill" its own connection? innobackupex may kill processes (but if it killed itself, it would fail differently?), and xtrabackup doesn't "own" any connections and doesn't kill anything.

I referred xtrabackup as product. innobackupex killed it's own connection, and the error is
innobackupex: Error:
Error executing 'SHOW MASTER STATUS': DBD::mysql::db selectrow_hashref failed: MySQL server has gone away

> Can you clarify?

I hope now things become little bit more clear.

Hi Sergei,

Thanks, it is makes more sense now.

On Wed, 21 Aug 2013 14:16:41 -0000, Sergei Glushchenko wrote:
> Hi Alexey,
>
>> I have to admit I don't understand anything from the above:
>
> I was not clear enough.
>
>> - what select and update queries are "forked" by xtrabackup?
>
> Not by xtrabackup, but by test case. Test case executes selects/updates
> in background and xtrabackup should kill or not to kill them depending
> on arguments specified.
>
>> - why does xtrabackup starting before forked processes are forked lead
> to a test failure? Which from those 2 failures does this case correspond
> to?
>
> Xtrabackup may start before forked (by test) queries started justs
> because nothing prevents it.
>
>> - what is case #2 (the comment mentions case 1 twice)
>> - how can case #2 only happen for case #1?
>
> I mixed case 1 (which is under ====== case 1 ====== section of test) and two cases of failure :)
> Lets refer the one with innobackupex started early as "a)" - it is the "ubuntu-raring-64bit".
> "b)" - centos5-32.
>
> a) tried to test following
> - run update query which lasts 3 seconds
> - run select query witch lasts 3 seconds
> - run innobackupex which will kill all queries 5 seconds after FTWRL
> - as a result both queries and xtrabackup should succeed
> This is failed because innobackupex started to work earlier that queries in background and killed one of them.
>

OK, so it is a pure test case problem. I don't see a way to implement
the necessary level of synchronization to make the test fully
deterministic. So let's just increase innobackupex timeouts.

>> - how can xtrabackup "kill" its own connection? innobackupex may kill
> processes (but if it killed itself, it would fail differently?), and
> xtrabackup doesn't "own" any connections and doesn't kill anything.
>
> I referred xtrabackup as product. innobackupex killed it's own connection, and the error is
> innobackupex: Error:
> Error executing 'SHOW MASTER STATUS': DBD::mysql::db selectrow_hashref failed: MySQL server has gone away
>

This one looks suspicious. Any ideas how could that happen? I.e. how
could the innobackupex connection pass the is_query() /
is_select_query() checks in kill_long_queries()?

Besides I don't see any "Killing query ..." messages in the log. Is
there any chance they are not actually printed when a query is killed?

Anyway, this is a functionality issue and should be reported separately.
Can you analyze and report it?

Alexey Kopytov (akopytov) wrote :

Here's another failure with different symptoms:

http://jenkins.percona.com/view/XtraBackup/job/percona-xtrabackup-2.1/152/BUILD_TYPE=debug,Host=ubuntu-lucid-32bit,xtrabackuptarget=xtradb56/testReport/junit/%28root%29/t_kill_long_selects/sh/

2013-08-21 23:46:24: run.sh: ===> /usr/bin/mysql --defaults-file=/home/jenkins/workspace/percona-xtrabackup-2.1/BUILD_TYPE/debug/Host/ubuntu-lucid-32bit/xtrabackuptarget/xtradb56/test/var/w1/var1/my.cnf test
ERROR 1317 (70100) at line 1 in file: '/home/jenkins/workspace/percona-xtrabackup-2.1/BUILD_TYPE/debug/Host/ubuntu-lucid-32bit/xtrabackuptarget/xtradb56/test/var/w1/var1/tmp/killall.sql': Query execution was interrupted
2013-08-21 23:46:29: run.sh: ===> mysql failed with exit code 1

I guess it's also a test case issue, so it can be fixed together with case a) from above comments.

Yes, the last one is the test case issue.

Regarding the innobackupex issue, I think I found what caused it. FTWRL was acquired immediately and main thread sent HUP signal to child process which responsible for killing queries to stop it. Signal was sent before child set up it's signal handler and set InactiveDestroy property for it's copy of parent's database connection handler.

Alexey Kopytov (akopytov) wrote :

Sergei, can you report it as a separate bug? Can we fix it with AutoInactiveDestroy?

You can also initialize 'end' to 1, but then reset it to 0 just before the 'while' loop. I'm not sure if that will remove the race between signal handler setup code in the child process and the SIGHUP sending in the parent completely, but should certainly reduce its probability.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers