Failed HA Controller deployment, mysql fails to start

Bug #1643670 reported by Dan Trainor on 2016-11-21
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Julie Pichon

Bug Description

3-controller 1-node deployments fail. Looking in to this further, it appears that mysql never starts properly on at least one node:

[stack@rdo-ci-fx2-04-s8 ~]$ openstack stack failures list RHELOSP-18748 --long
RHELOSP-18748.AllNodesDeploySteps.ControllerDeployment_Step2.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 3101ba5c-a361-4377-9b4d-05e01fb77a9b
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
  deploy_stdout: |
    Matching apachectl 'Server version: Apache/2.4.6 (Red Hat Enterprise Linux)
    Server built: Aug 3 2016 08:33:27'
    Notice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.
    Notice: Compiled catalog for rhelosp-18748-controller-0.localdomain in environment production in 13.57 seconds
<snip>
    Notice: /Stage[main]/Nova::Db::Mysql/Openstacklib::Db::Mysql[nova]/Mysql_database[nova]: Dependency Exec[galera-ready] has failures: true

Logging in to the controllers shows that mysql failed to start on at least two Controller nodes.

/var/log/mysql.log and /var/log/mariadb/mariadb.log attached from all three nodes, per bandini's request.

Dan Trainor (dtrainor) wrote :
Dan Trainor (dtrainor) wrote :
Dan Trainor (dtrainor) wrote :
Damien Ciabrini (dciabrin) wrote :

Looking at the logs, controller-1 bootstrap the cluster, the other two node tried to join afterwards and request a SST (rsync) to sync their local state.

Log show the SST requests always fail due to "connection refused" errors. For instance, when controller-0 request a SST from controller-1, it starts a rsyncd and wait for data to be transferred from controller-1 via rsync.

161121 18:03:58 [Note] WSREP: Member 0.0 (rhelosp-18748-controller-0.localdomain) requested state transfer from '*any*'. Selected 1.0 (rhelosp-18748-controller-1.localdomain)(SYNCED) as donor.
161121 18:03:58 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 0)
161121 18:03:58 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
161121 18:03:58 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address 'rhelosp-18748-controller-0.internalapi.localdomain:4444/rsync_sst' --auth '(null)' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'e819dd00-b013-11e6-93a9-c3de2248d2b2:0''

What's odd is that controller-1 seems to resolve dns of controller-1 as "localhost" (ipv6 and ipv4 tried by rsync):

rsync: failed to connect to rhelosp-18748-controller-0.internalapi.localdomain (::1%1): Connection refused (111)
rsync: failed to connect to rhelosp-18748-controller-0.internalapi.localdomain (127.0.0.1): Connection refused (111)
rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.9]

Consequently the rsync transfer fails, and the controller-0 and controller-2 can never join the galera cluster.

looking at /etc/hosts I see:

172.17.0.21 rhelosp-18748-controller-0.internalapi. rhelosp-18748-controller-0.internalapi

Could it be that the "." at the end of all HEAT-generated entries is what causes the issue?

Julie Pichon (jpichon) wrote :

Damien, thanks for pointing toward the dot. I noticed that too but since it was somehow still pingable, forgot about it.

I finally went past the galera/mariadb issues in my environment. Here's what I think happened:

1. For one reason or another, we edited the parameters for one of the roles (e.g. for me, that was to work around bug 1642342 and set the different role flavors)
2. CloudDomain got updated to "" in the mistral environment despite not changing it
3. Failure on galera/mariadb for every deployment, in both Newton and Ocata

Looking at the help string for CloudDomain in the UI, it reads: "The DNS domain used for the hosts. This should match the dhcp_domain configured in the Undercloud neutron. Defaults to localdomain." but the UI does not set the default.

I set that parameter to localdomain, restarted my deployment and it finally worked.

I'm not sure where the fix needs to go for the UI to pick it up. Are we not extracting the default correctly, or are the templates not defining it in the expected way?

tags: added: newton-backport-potential ui
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → ocata-2
Jiri Tomasek (jtomasek) wrote :

CloudDomain parameter has default correctly set only in overcloud.j2.yaml but not in role templates such as puppet/caphstorage-role.yaml etc. the default value is not set there and therefore GUI sets the value to empty string.

Related bug: https://bugs.launchpad.net/tripleo/+bug/1640243

tags: added: tripleo-heat-templates
Jiri Tomasek (jtomasek) wrote :

In addition, if the value constraint was properly defined in parameter definition, we'd be able to avoid such bugs.

Julie Pichon (jpichon) wrote :

It looks like we're hitting another version of https://review.openstack.org/#/c/354069/ . I will propose a patch to reinstate the defaults and make them match the docstring which is also copy-pasted in each file. Hopefully that will be acceptable.

Changed in tripleo:
assignee: nobody → Julie Pichon (jpichon)
Julie Pichon (jpichon) wrote :

Also, once again thank you for the invaluable pointers Jirka.

Fix proposed to branch: master
Review: https://review.openstack.org/400966

Changed in tripleo:
status: Triaged → In Progress
Julie Pichon (jpichon) wrote :

I only notice now that the comments in https://review.openstack.org/#/c/354069/ mention the approach I took in the patch - it seems like everything should work the same with it, only the duplication being a bit unfortunate.

Reviewed: https://review.openstack.org/400966
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=0ca8dab4cb2acb2eb3904e4edfbbd33a47fa97a3
Submitter: Jenkins
Branch: master

commit 0ca8dab4cb2acb2eb3904e4edfbbd33a47fa97a3
Author: Julie Pichon <email address hidden>
Date: Tue Nov 22 20:39:33 2016 +0000

    Make the CloudDomain defaults match the doc strings

    Not having the default easily accessible is causing issues for the UI,
    as it cannot guess at it and can accidentally overwrite the value with
    an empty string (the expected default when unset). The default is
    already helpfully spelled out in the doc string for each file, this
    updates the parameter to match it.

    Change-Id: Ic284f9904e8f1d01cc717d59a0759f679d94106d
    Closes-Bug: #1643670

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/401359
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c739dcda9f387196af721cf1e675867c3da97d02
Submitter: Jenkins
Branch: stable/newton

commit c739dcda9f387196af721cf1e675867c3da97d02
Author: Julie Pichon <email address hidden>
Date: Tue Nov 22 20:39:33 2016 +0000

    Make the CloudDomain defaults match the doc strings

    Not having the default easily accessible is causing issues for the UI,
    as it cannot guess at it and can accidentally overwrite the value with
    an empty string (the expected default when unset). The default is
    already helpfully spelled out in the doc string for each file, this
    updates the parameter to match it.

    Change-Id: Ic284f9904e8f1d01cc717d59a0759f679d94106d
    Closes-Bug: #1643670
    (cherry picked from commit 0ca8dab4cb2acb2eb3904e4edfbbd33a47fa97a3)

tags: added: in-stable-newton

This issue was fixed in the openstack/tripleo-heat-templates 6.0.0.0b2 development milestone.

This issue was fixed in the openstack/tripleo-heat-templates 5.2.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers