pt-mysql-summary does not work with MariaDB 10.2.5

Bug #1680803 reported by Sveta Smirnova on 2017-04-07
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
New
Undecided
Unassigned

Bug Description

pt-mysql-summary hangs at the Processlist summary stage when connected to MariaDB 10.2.5:

sveta@Thinkie:~/src/percona-toolkit$ pt-mysql-summary --socket=/home/sveta/mysql_packages/mariadb-10.2.5-linux-x86_64/mysql-test/var/tmp/mysqld.1.sock --user=root
# Percona Toolkit MySQL Summary Report #######################
              System time | 2017-04-07 10:33:54 UTC (local TZ: +03 +0300)
# Instances ##################################################
  Port Data Directory Nice OOM Socket
  ===== ========================== ==== === ======
  16000 /home/sveta/mysql_packages/mariadb-10.2.5-linux-x86_64/mysql-test/var/mysqld.1/data 0 0 /home/sveta/mysql_packages/mariadb-10.2.5-linux-x86_64/mysql-test/var/tmp/mysqld.1.sock
   3306 /var/lib/mysql 0 0 /var/run/mysqld/mysqld.sock
   3371 /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.16-rel10-27.19.1.Linux.x86_64.ssl100/data1 0 0 /tmp/pxc1.sock
   3373 /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.16-rel10-27.19.1.Linux.x86_64.ssl100/data3 0 0 /tmp/pxc3.sock
   3374 /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.16-rel10-27.19.1.Linux.x86_64.ssl100/data4 0 0 /tmp/pxc4.sock
   3372 /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.16-rel10-27.19.1.Linux.x86_64.ssl100/data2 0 0 /tmp/pxc2.sock
# MySQL Executable ###########################################
       Path to executable | ./bin/mysqld
              Has symbols | Yes
       Path to executable | /home/sveta/mysql_packages/mariadb-10.2.5-linux-x86_64/bin/mysqld
              Has symbols | Yes
       Path to executable | /usr/sbin/mysqld
              Has symbols | Yes
# Report On Port 16000 #######################################
                     User | root@localhost
                     Time | 2017-04-07 13:33:54 (+03)
                 Hostname | Thinkie
                  Version | 10.2.5-MariaDB-log MariaDB Server
                 Built On | Linux x86_64
                  Started | 2017-04-07 13:31 (up 0+00:01:56)
                Databases | 5
                  Datadir | /home/sveta/mysql_packages/mariadb-10.2.5-linux-x86_64/mysql-test/var/mysqld.1/data/
                Processes | 1 connected, 1 running
              Replication | Is not a slave, has 0 slaves connected
                  Pidfile | /home/sveta/mysql_packages/mariadb-10.2.5-linux-x86_64/mysql-test/var/run/mysqld.1.pid (exists)
# Processlist ################################################

  Command COUNT(*) Working SUM(Time) MAX(Time)
  ------------------------------ -------- ------- --------- ---------

^C2017_04_07_13_50_17 Caught signal, forcing exit

Tested with versions 2.2.20 and 3.0.2

tags: added: pt120
Download full text (8.1 KiB)

I see no problem like that with 10.2.6 (on Fedora 25) or 10.2.5 (on Ubuntu 14.04) built from source:

openxs@ao756:~/dbs/maria10.2$ pt-mysql-summary --version
pt-mysql-summary 3.0.2
openxs@ao756:~/dbs/maria10.2$ pt-mysql-summary --socket=/tmp/mariadb.sock
# Percona Toolkit MySQL Summary Report #######################
              System time | 2017-04-18 09:58:52 UTC (local TZ: EEST +0300)
# Instances ##################################################
  Port Data Directory Nice OOM Socket
  ===== ========================== ==== === ======
        /var/lib/mysql 0 0
   3307 /home/openxs/dbs/maria10.2/data 0 0 /tmp/mariadb.sock
# MySQL Executable ###########################################
       Path to executable | /home/openxs/dbs/maria10.2/bin/mysqld
              Has symbols | Yes
       Path to executable | /usr/sbin/mysqld
              Has symbols | No
# Report On Port 3307 ########################################
                     User | @localhost
                     Time | 2017-04-18 12:58:52 (EEST)
                 Hostname | ao756
                  Version | 10.2.5-MariaDB Source distribution
                 Built On | Linux x86_64
                  Started | 2017-04-18 12:57 (up 0+00:02:21)
                Databases | 2
                  Datadir | /home/openxs/dbs/maria10.2/data/
                Processes | 1 connected, 1 running
              Replication | Is not a slave, has 0 slaves connected
                  Pidfile | /home/openxs/dbs/maria10.2/data/ao756.pid (exists)
# Processlist ################################################

  Command COUNT(*) Working SUM(Time) MAX(Time)
  ------------------------------ -------- ------- --------- ---------

  User COUNT(*) Working SUM(Time) MAX(Time)
  ------------------------------ -------- ------- --------- ---------

  Host COUNT(*) Working SUM(Time) MAX(Time)
  ------------------------------ -------- ------- --------- ---------

  db COUNT(*) Working SUM(Time) MAX(Time)
  ------------------------------ -------- ------- --------- ---------

  State COUNT(*) Working SUM(Time) MAX(Time)
  ------------------------------ -------- ------- --------- ---------

# Status Counters (Wait 10 Seconds) ##########################
Variable Per day Per second 11 secs
Access_denied_errors 7000
...
Questions 70000 3
Rocksdb_block_cache_add 3000
Rocksdb_block_cache_data_hit 10000
Rocksdb_block_cache_data_miss 1250
Rocksdb_block_cache_hit 17500
Rocksdb_block_cache_index_hit 8000
Rocksdb_block_cache_index_miss 1750
Rocksdb_block_cache_miss 3000
Rocksdb_bytes_read 125000 1
Rocksdb_bytes_written 7000
Rocksdb_compact_write_bytes 700000 7
Rocksdb_memtable_miss 5000
Rocksdb_no_file_opens 1750
Rocksdb_number_keys_re...

Read more...

My 10.2.5 is older than April 7:

openxs@ao756:~/dbs/maria10.2$ ls -l bin/mysqld
-rwxr-xr-x 1 openxs openxs 147978700 Apr 5 13:43 bin/mysqld

Elena Stepanova (elenst) wrote :

To reproduce, the 10.2 server should be started with InnoDB, and even then I imagine it can be system- or luck-dependent.

MariaDB 10.2 prints information about InnoDB system threads in the processlist. Like this:

+----+-------------+-----------------+------+---------+------+--------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+----+-------------+-----------------+------+---------+------+--------------------------+------------------+----------+
| 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |
| 2 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |
| 1 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |
| 3 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 |
| 5 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 |
| 26 | root | localhost:40405 | test | Query | 0 | init | show processlist | 0.000 |
+----+-------------+-----------------+------+---------+------+--------------------------+------------------+----------+

Note NULL in the Time column.

The pt tool hangs (or rather falls into an endless loop) in summarize_processlist, where it uses the column to populate time and mtime.
For 'time', it seems to work okay, as it's populated by adding the value, and apparently it's implicitly converted into 0 on the way.
For 'mtime', under certain conditions it uses the value as is. Thus mtime becomes equal 'NULL' (the string).
Further, dark magic applies inside fuzzy_formula, which compares the value to various numbers and continues performing arithmetical operations until the result of the comparison is satisfactory. Among other things, it keeps increasing the 'factor' variable which is later used for more arithmetic.

The rest depends on the system/luck, apparently. 'factor' overflows rather quickly, if it becomes 0, the loop ends. If it becomes something ugly, the loop continues forever.

On the MariaDB side, printing the threads is a conscious decision (see https://jira.mariadb.org/browse/MDEV-11703). The question will be raised whether they should have NULL in the Time column, although coming up with a reasonable alternative might be challenging.

An easy fix for the pt tool is to convert a non-digital value to 0. Something like this, but maybe a bit smarter (as far as shell/awk scripting goes):

--- bin/pt-mysql-summary.orig 2017-04-19 00:29:48.992822955 +0300
+++ bin/pt-mysql-summary 2017-04-19 00:33:26.692815665 +0300
@@ -1334,6 +1334,9 @@
          }
          \$1 == \"Time:\" {
             t = \$2;
+ if ( t == \"NULL\" ) {
+ t = 0;
+ }
          }
          \$1 == \"Command:\" {
             c = \$2;

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-1422

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers