[MOS 9.0] Controller node redeployment fails with „Failed to call refresh“ error

Bug #1607793 reported by Matija Draganovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Fuel Sustaining
Mitaka
Confirmed
Medium
Fuel Sustaining

Bug Description

[MOS 9.0] Controller node redeployment fails with „Failed to call refresh“ error

Environment:
1x Mgmt node (Fuel)
3x Controller node
3x Storage node
3x Compute node
MOS 9.0

Steps to reproduce the error:
- Shut down controller node
- Remove it from the environment via Fuel Web UI and run deployment. Fuel removes the node and updates active configuration
- Boot the server using PXE. Fuel notices a new node, bootstraps it and marks as a newly discovered system
- Add the node back to the environment (assign controller role) and run the deployment process again. Fuel runs the deployment but in the end fails with the following error(s):
2016-07-29 11:41:02 ERR (/Stage[main]/Cluster::Mysql/Exec[wait-initial-sync]) mysql -uclustercheck -pIKwpyRhWEyKGd7kHJKZGrcCo -Nbe "show status like 'wsrep_local_state_comment'" | grep -q -e Synced && sleep 10 returned 1 instead of one of [0]
2016-07-29 11:41:02 ERR (/Stage[main]/Cluster::Mysql/Exec[wait-initial-sync]) Failed to call refresh: mysql -uclustercheck -pIKwpyRhWEyKGd7kHJKZGrcCo -Nbe "show status like 'wsrep_local_state_comment'" | grep -q -e Synced && sleep 10 returned 1 instead of one of [0]

At this point it seems that MySQL on the newly added node gets stuck in a restart loop. The deployment fails and restarting it doesn't help. Deployment can be sucessfully finished by the following steps:

## Connect to the new node (the one with the "error" status):
[root@fuel ~]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---+---------+------------------+---------+------------+-------------------+-------------------+---------------+--------+---------
 7 | stopped | Untitled (4e:f2) | 1 | 10.20.0.8 | ec:38:8f:6d:4e:f2 | ceph-osd | | 1 | 1
20 | stopped | Untitled (4e:e8) | 1 | 10.20.0.4 | ec:38:8f:6d:4e:e8 | controller, mongo | | 1 | 1
21 | error | Untitled (4e:ea) | 1 | 10.20.0.6 | ec:38:8f:6d:4e:ea | controller, mongo | | 1 | 1
 3 | stopped | Untitled (df:2f) | 1 | 10.20.0.5 | a4:dc:be:1c:df:2f | ceph-osd | | 1 | 1
 1 | stopped | Untitled (4b:88) | 1 | 10.20.0.3 | ec:38:8f:6d:4b:88 | compute | | 1 | 1
 9 | stopped | Untitled (4b:80) | 1 | 10.20.0.10 | ec:38:8f:6d:4b:80 | compute | | 1 | 1
18 | stopped | Untitled (4e:f6) | 1 | 10.20.0.7 | ec:38:8f:6d:4e:f6 | controller, mongo | | 1 | 1
 6 | stopped | Untitled (4e:f4) | 1 | 10.20.0.9 | ec:38:8f:6d:4e:f4 | ceph-osd | | 1 | 1
 8 | stopped | Untitled (4b:82) | 1 | 10.20.0.11 | ec:38:8f:6d:4b:82 | compute | | 1 | 1
[root@fuel ~]# ssh 10.20.0.6
Warning: Permanently added '10.20.0.6' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-92-generic x86_64)

 * Documentation: https://help.ubuntu.com/

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

## Verify that MySQL is not available:
root@node-21:~# mysql
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)
root@node-21:~# ps -ef | grep -i mysql
root 17182 1 0 11:53 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --pid-file=/var/run/resource-agents/mysql-wss/mysql-wss.pid --socket=/var/run/mysqld/mysqld.sock --datadir=/var/lib/mysql --user=mysql --init-file=/tmp/wsrep-init-file
mysql 18281 17182 1 11:53 ? 00:00:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --init-file=/tmp/wsrep-init-file --open-files-limit=102400 --pid-file=/var/run/resource-agents/mysql-wss/mysql-wss.pid --socket=/var/run/mysqld/mysqld.sock --port=3307 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
root 18282 17182 0 11:53 ? 00:00:00 logger -t mysqld -p daemon.error
mysql 18315 18281 0 11:53 ? 00:00:00 sh -c wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.0.7' --auth 'root:h8qxNjs9KyPRdsDrdS5hIJ7a' --datadir '/var/lib/mysql/' --defaults-file '/etc mysql/my.cnf' --defaults-group-suffix '' --parent '18281' ''
mysql 18316 18315 0 11:53 ? 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 192.168.0.7 --auth root:h8qxNjs9KyPRdsDrdS5hIJ7a --datadir /var/lib/mysql/ --defaults-file /etc/mysql/my.cnf --defaults-group-suffix --parent 18281
mysql 19130 18316 0 11:53 ? 00:00:00 perl /usr//bin/innobackupex --no-version-check --apply-log /var/lib/mysql//.sst
mysql 19195 19130 0 11:53 ? 00:00:00 sh -c xtrabackup_56 --defaults-file="/var/lib/mysql/.sst/backup-my.cnf" --defaults-group="mysqld" --prepare --target-dir=/var/lib/mysql/.sst --tmpdir=/tmp
mysql 19196 19195 7 11:53 ? 00:00:03 xtrabackup_56 --defaults-file=/var/lib/mysql/.sst/backup-my.cnf --defaults-group=mysqld --prepare --target-dir=/var/lib/mysql/.sst --tmpdir=/tmp
root 19575 16253 0 11:54 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/fuel/mysql-wss start
root 19645 19575 0 11:54 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/fuel/mysql-wss start
root 19646 19645 2 11:54 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --wsrep-recover --log-error=/dev/stdout
root 20514 19646 0 11:54 ? 00:00:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep-recover --log-error=/dev/stdout.err --open-files-limit=102400 --pid-file=/var/lib/mysql/node-21.domain.tld.pid --socket=/var/run/mysqld/mysqld.sock --port=3307 --wsrep_recover --log_error=/var/lib/mysql/wsrep_recovery.plzso2 --pid-file=/var/lib/mysql/node-21.domain.tld-recover.pid
root 20516 19473 0 11:54 pts/3 00:00:00 grep --color=auto -i mysql

## Stop pacemaker:
root@node-21:~# service pacemaker stop
Signaling Pacemaker Cluster Manager to terminate[ OK ]
Waiting for cluster services to unload.....[ OK ]

## Kill leftover MySQL processes:
root@node-21:~# ps -ef | grep -i mysql
root 7852 1 0 12:05 ? 00:00:00 logger -t mysqld -p daemon.error
mysql 7888 1 0 12:05 ? 00:00:00 sh -c wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.0.7' --auth 'root:h8qxNjs9KyPRdsDrdS5hIJ7a' --datadir '/var/lib/mysql/' --defaults-file '/etc mysql/my.cnf' --defaults-group-suffix '' --parent '7851' ''
mysql 7889 7888 0 12:05 ? 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 192.168.0.7 --auth root:h8qxNjs9KyPRdsDrdS5hIJ7a --datadir /var/lib/mysql/ --defaults-file /etc/mysql/my.cnf --defaults-group-suffix --parent 7851
mysql 8152 7889 0 12:05 ? 00:00:00 socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql 8153 7889 1 12:05 ? 00:00:00 xbstream -x
root 8813 19473 0 12:06 pts/3 00:00:00 grep --color=auto -i mysql
root@node-21:~# kill -9 7852 7888 7889 8152 8153
root@node-21:~# ps -ef | grep -i mysql
mysql 8872 1 0 12:06 ? 00:00:00 perl /usr//bin/innobackupex --no-version-check --apply-log /var/lib/mysql//.sst
mysql 8910 8872 0 12:06 ? 00:00:00 sh -c xtrabackup_56 --defaults-file="/var/lib/mysql/.sst/backup-my.cnf" --defaults-group="mysqld" --prepare --target-dir=/var/lib/mysql/.sst --tmpdir=/tmp
mysql 8911 8910 16 12:06 ? 00:00:01 xtrabackup_56 --defaults-file=/var/lib/mysql/.sst/backup-my.cnf --defaults-group=mysqld --prepare --target-dir=/var/lib/mysql/.sst --tmpdir=/tmp
root 9089 19473 0 12:06 pts/3 00:00:00 grep --color=auto -i mysql
root@node-21:~# kill -9 8872 8910 8911
root@node-21:~# ps -ef | grep -i mysql
root 9622 19473 0 12:07 pts/3 00:00:00 grep --color=auto -i mysql

## Start MySQL manually and wait for it to finish the recovery and syncronize with the other nodes:
root@node-21:~# /etc/init.d/mysql start
mysql start/running, process 9840
root@node-21:~#
root@node-21:~#
root@node-21:~# tail -f /var/lib/mysql/innobackup.prepare.log /var/lib/mysql/innobackup.move.log
==> /var/lib/mysql/innobackup.prepare.log <==

[notice (again)]
  If you use binary log and don't use any hack of group commit,
  the binary log position seems to be:

xtrabackup: starting shutdown with innodb_fast_shutdown = 1
InnoDB: FTS optimize thread exiting.
InnoDB: Starting shutdown...
InnoDB: Shutdown completed; log sequence number 629429384
160729 12:09:01 innobackupex: completed OK!

==> /var/lib/mysql/innobackup.move.log <==
innobackupex: back to '/var/lib/mysql'

innobackupex: Starting to move InnoDB log files
innobackupex: in '/var/lib/mysql/.sst'
innobackupex: back to original InnoDB log directory '/var/lib/mysql'
innobackupex: Moving '/var/lib/mysql/.sst/ib_logfile0' to '/var/lib/mysql/ib_logfile0'
innobackupex: Moving '/var/lib/mysql/.sst/ib_logfile1' to '/var/lib/mysql/ib_logfile1'
innobackupex: Finished copying back files.

160729 12:09:01 innobackupex: completed OK!
^C

## Verify that the service is up & running:
root@node-21:~# mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 22
Server version: 5.6.23-1~u14.04+mos1 (Ubuntu), wsrep_25.10

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| aodh |
| cinder |
| glance |
| heat |
| keystone |
| murano |
| mysql |
| neutron |
| nova |
| nova_api |
| performance_schema |
| test |
+--------------------+
13 rows in set (0.01 sec)

mysql> exit
Bye

## Stop MySQL, restart pacemaker and wait for it to bring MySQL resource automatically
root@node-21:~# /etc/init.d/mysql stop
mysql stop/waiting
root@node-21:~# service pacemaker start
Starting Pacemaker Cluster Manager[ OK ]
root@node-21:~# ps -ef | grep -i mysql
root 12810 12033 0 12:10 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/fuel/mysql-wss start
root 13026 12810 0 12:10 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/fuel/mysql-wss start
root 13027 13026 0 12:10 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --wsrep-recover --log-error=/dev/stdout
mysql 13889 13027 56 12:10 ? 00:00:01 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep-recover --log-error=/dev/stdout.err --open-files-limit=102400 --pid-file=/var/lib/mysql/node-21.domain.tld.pid --socket=/var/run/mysqld/mysqld.sock --port=3307 --wsrep_recover --log_error=/var/lib/mysql/wsrep_recovery.U9pJs7 --pid-file=/var/lib/mysql/node-21.domain.tld-recover.pid
nobody 13934 13933 0 12:10 ? 00:00:00 mysql -nNE --connect-timeout=10 --user=clustercheck --password=x xxxxxxxxxxxxxxxxxxxxxx -h 192.168.0.7 -P 3307 -e SHOW STATUS LIKE 'wsrep_local_state';
nobody 13938 13937 0 12:10 ? 00:00:00 mysql -nNE --connect-timeout=10 --user=clustercheck --password=x xxxxxxxxxxxxxxxxxxxxxx -h 192.168.0.7 -P 3307 -e SHOW STATUS LIKE 'wsrep_local_state';
nobody 13942 13941 0 12:10 ? 00:00:00 mysql -nNE --connect-timeout=10 --user=clustercheck --password=x xxxxxxxxxxxxxxxxxxxxxx -h 192.168.0.7 -P 3307 -e SHOW STATUS LIKE 'wsrep_local_state';
root 13945 19473 0 12:10 pts/3 00:00:00 grep --color=auto -i mysql

## Verify that MySQL is available:
root@node-21:~# mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 27
Server version: 5.6.23-1~u14.04+mos1 (Ubuntu), wsrep_25.10

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| aodh |
| cinder |
| glance |
| heat |
| keystone |
| murano |
| mysql |
| neutron |
| nova |
| nova_api |
| performance_schema |
| test |
+--------------------+
13 rows in set (0.00 sec)

mysql> exit
Bye

## Return to the Fuel Web UI and run the deployment again. Now it should finish sucessfully.

The issue has been reproduced several times on distinct controller nodes (actually, on all of the 3 nodes in the setup). This was noticed only in the case of node redeployment (remove node + refresh environment + add node) - the deployment of a fresh environment didn't expose such issues (tested several times also).

The following logs are attached to the case:
- Fuel Web UI puppet log
- Node's puppet log
- MySQL error log
- Pacemaker log

Revision history for this message
Matija Draganovic (matija-draganovic) wrote :
Revision history for this message
Matija Draganovic (matija-draganovic) wrote :
Revision history for this message
Matija Draganovic (matija-draganovic) wrote :
Revision history for this message
Matija Draganovic (matija-draganovic) wrote :
Changed in fuel:
milestone: none → 9.1
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
importance: Undecided → Medium
status: New → Confirmed
tags: added: customer-found
tags: added: area-library
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 9.1 → 10.0
tags: added: release-notes
tags: added: release-notes-done
removed: release-notes
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.