CI ovb-ha job is broken on Galera setup with Puppet4

Bug #1645787 reported by Emilien Macchi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Emilien Macchi

Bug Description

Error: Failed to apply catalog: Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 "No such file or directory")

When deploying HA scenario with MySQL Galera.
It might be an orchestration issue/ordering.

tags: added: alert
Revision history for this message
Damien Ciabrini (dciabrin) wrote :

From this failed CI job [1], it looks like the galera cluster has not been started at all. I don't see _any_ logs from the galera resource agent on any of the controllers. This probably means that the galera resource hasn't been created as expected, or that something prevented pacemaker to enable the galera resource (i.e. start the galera cluster)

For the record, when the galera resource is created under pacemaker, we should see logs like "galera(galera).*INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover' on all controllers in /var/log/messages
/var/log/mariadb/mariadb.log would reflect that action with a log like "Running position recovery with.*"

the subsequent DB log always go in /var/log/mysqld.log (mariadb.log is only used for bootstrapping the cluster)

Revision history for this message
Damien Ciabrini (dciabrin) wrote :
Revision history for this message
Damien Ciabrini (dciabrin) wrote :
Download full text (3.4 KiB)

Emilien saw that /etc/sysconfig/clustercheck file is not created on controller-0, while it seems to be created on other nodes controller-1 and controller-2.

wget http://logs.openstack.org/23/404223/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/8bfb546/logs/overcloud-controller-1/var/log/messages -O- 2>/dev/null | grep -F 'clustercheck' | tail -1
Nov 29 15:34:01 localhost os-collect-config: #033[mNotice: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Exec[create-root-sysconfig-clustercheck]/returns: executed successfully#033[0m

So it seems something interrupted the deploy. Could it be related to loss of quorum which is getting in the way?:

# wget http://logs.openstack.org/23/404223/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/8bfb546/logs/overcloud-controller-1/var/log/messages -O- 2>/dev/null | grep QUORUM
Nov 29 15:32:00 localhost corosync[26839]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 29 15:32:00 localhost corosync[26839]: [QUORUM] Members[1]: 2
Nov 29 15:32:23 localhost corosync[26839]: [QUORUM] This node is within the primary component and will provide service.
Nov 29 15:32:23 localhost corosync[26839]: [QUORUM] Members[3]: 2 3 1

Although it seems quorum got recovered, this is never a good sigh and should not happen in the first place. A few seconds after this quorum log, I see the first suspicious log related to connections:

Nov 29 15:32:32 localhost os-collect-config: sure: created\u001b[0m\n\u001b[mNotice: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Config[haproxy]/Concat[/etc/haproxy/haproxy.cfg]/File[/etc/haproxy/haproxy.cfg]/content: content changed '{md5}1f337186b0e1ba5ee82760cb437fb810' to '{md5}fe996ee969512d62439f8ed86b2490fc'\u001b[0m\n\u001b[mNotice: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Config[haproxy]/Concat[/etc/haproxy/haproxy.cfg]/File[/etc/haproxy/haproxy.cfg]/seluser: seluser changed 'unconfined_u' to 'system_u'\u001b[0m\n\u001b[mNotice: /Stage[main]/Tripleo::Profile::Base::Haproxy/Exec[haproxy-reload]: Triggered 'refresh' from 1 events\u001b[0m\n\u001b[mNotice: Applied catalog in 115.75 seconds\u001b[0m\n", "deploy_stderr": "exception: connect failed\n\u001b[1;33mWarning: This method is deprecated, please use match expressions with Stdlib::Compat::Array instead. They are described at https://docs.puppet.com/puppet/latest/reference/lang_data_type.html#match-expressions.\n (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:19:in `deprecation')\u001b[0m\n\u001b[1;33mWarning: This method is deprecated, please use the stdlib validate_legacy function, with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README.\n (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:19:in `deprecation')\u001b[0m\n\u001b[1;33mWarning: ModuleLoader: module 'mysql' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\n (file & line not available)\u001b[0m\n\u001b[1;33mWarning: ModuleLoader: module 'rabbitmq' has unresolved dependencies - it will only see those that are resolved. U...

Read more...

Revision history for this message
Emilien Macchi (emilienm) wrote :

Just FYI Damien, in your last comment you're looking at the wrong logs.
Currently, create-root-sysconfig-clustercheck is not executed and I suspect mysqld process not started.

Revision history for this message
Emilien Macchi (emilienm) wrote :

What I meant by wrong logs is th

Revision history for this message
Emilien Macchi (emilienm) wrote :

What I meant by wrong logs is that http://logs.openstack.org/23/404223/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/8bfb546/logs/overcloud-controller-1/var/log/messages is not log with puppet4 but puppet3 and it worked fine.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/404437

tags: removed: alert
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/404437
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=f101ee540a450d3509e1c97aab00a97a2ecf8164
Submitter: Jenkins
Branch: master

commit f101ee540a450d3509e1c97aab00a97a2ecf8164
Author: Emilien Macchi <email address hidden>
Date: Tue Nov 29 16:57:20 2016 -0500

    pacemaker: create Mysql_user once Galera is ready (puppet4)

    Puppet 4 ordering make things more strict in catalog, which is good.
    Resources have to be orchestrated or Puppet will take them in the order
    they are found in catalog.

    This patch makes sure we create MySQL users only when Galera is actually
    ready.

    Closes-Bug: #1645787
    Change-Id: I536a1a128c3a7eca49bcc4f34a1307bcd60b029e

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 6.1.0

This issue was fixed in the openstack/puppet-tripleo 6.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/529114

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/newton)

Reviewed: https://review.openstack.org/529114
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=2576acbb897fd077e336a5bc472a6015ee04d328
Submitter: Zuul
Branch: stable/newton

commit 2576acbb897fd077e336a5bc472a6015ee04d328
Author: Emilien Macchi <email address hidden>
Date: Tue Nov 29 16:57:20 2016 -0500

    pacemaker: create Mysql_user once Galera is ready (puppet4)

    Puppet 4 ordering make things more strict in catalog, which is good.
    Resources have to be orchestrated or Puppet will take them in the order
    they are found in catalog.

    This patch makes sure we create MySQL users only when Galera is actually
    ready.

    Closes-Bug: #1645787
    Change-Id: I536a1a128c3a7eca49bcc4f34a1307bcd60b029e
    (cherry picked from commit f101ee540a450d3509e1c97aab00a97a2ecf8164)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 5.6.6

This issue was fixed in the openstack/puppet-tripleo 5.6.6 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.