[library] while Galera node is in Sync or Donor state many services are down

Bug #1293680 reported by Brad Durrow
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Sergii Golovatiuk

Bug Description

Fuel 4.0

During:

[root@node-17 ~]# mysql -e "show status like 'wsrep_local_state_comment';"
+---------------------------+-----------------------------------+
| Variable_name | Value |
+---------------------------+-----------------------------------+
| wsrep_local_state_comment | Joining: receiving State Transfer |
+---------------------------+-----------------------------------+

I find that many services hang or fail. To put another way the HA features do not seem to provide high availability.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Please provide more comprhensive and complete bug description. Which services are down? On which nodes? On which node are you issuing this mysql command?

Changed in fuel:
status: New → Incomplete
Revision history for this message
Brad Durrow (l-brad) wrote :
Download full text (5.7 KiB)

Precondition:
HA Cluster Fuel 4.0
All controllers and services are up as reported by crm status
One controller recently rebooted or had mysql restarted

When I use the command "nova service-list" (as an example) to get a list of openstack nodes and roles I expect output that contains a list of openstack services. Instead I get a timeout.

For reference:
node-16=10.0.5.2
node-17=10.0.5.3
node-18=10.0.5.4
vip__management_old=10.0.5.10

[root@node-18 ~]# nova service-list
ERROR: HTTPConnectionPool(host='10.0.5.10', port=5000): Max retries exceeded with url: /v2.0/tokens
[root@node-18 ~]# crm status
Last updated: Tue Mar 18 06:50:48 2014
Last change: Tue Mar 18 06:39:53 2014 via crmd on node-16.domain.com
Stack: openais
Current DC: node-16.domain.com - partition with quorum
Version: 1.1.8-1.el6-1f8858c
3 Nodes configured, 3 expected votes
19 Resources configured.

Online: [ node-16.domain.com node-17.domain.com node-18.domain.com ]

 vip__management_old (ocf::heartbeat:IPaddr2): Started node-16.domain.com
 vip__public_old (ocf::heartbeat:IPaddr2): Started node-17.domain.com
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-18.domain.com
 p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-16.domain.com
 openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-17.domain.com
 p_openstack-ceilometer-central (ocf::mirantis:ceilometer-agent-central): Started node-18.domain.com
 p_openstack-ceilometer-alarm-evaluator (ocf::mirantis:ceilometer-alarm-evaluator): Started node-16.domain.com

Failed actions:
    p_mysql_monitor_60000 (node=node-18.domain.com, call=95, rc=7, status=complete): not running
[root@node-18 ~]# crm resource restart clone_p_haproxy
INFO: ordering clone_p_haproxy to stop
INFO: ordering clone_p_haproxy to start
[root@node-18 ~]# mysql -e "show status like 'wsrep%';"
+----------------------------+----------------------------------------------+
| Variable_name | Value |
+----------------------------+----------------------------------------------+
| wsrep_local_state_uuid | |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 18446744073709551615 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 1 |
| wsrep_received_bytes | 274 |
| wsrep_local_commits | 0 ...

Read more...

Revision history for this message
Andrew Woodward (xarses) wrote :

We need to verify that haproxy uses the right node if master is in Joining state transfer or as a donor

@Brad, if you enter this state again, see if mysql is available when you connect to the management-vip:3306

Changed in fuel:
status: Incomplete → New
tags: added: customer-found
description: updated
Changed in fuel:
status: New → Triaged
importance: Undecided → High
milestone: none → 4.1.1
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

As a solution we could also track joining and donor statuses via extended CIB attributes for nodes in pacemaker cluster.
That would allow us to define management (and public, if needed) VIP collocation as "-inf" for both donor and joining nodes, automatically ensuring the VIP will land at the 3rd node running which is "free for tasks"...

Revision history for this message
Brad Durrow (l-brad) wrote :

Bogdan, For my fuel 4.0 deployment it looks like services use the management vip to connect to mysql but :3306 has haproxy listening behind it. I believe that it would be more desireable to make mysql connections result in tcp rst so haproxy can quickly fail.

According to the haproxy documentation I found the option mysql-check works like this:

  ...the check consists of sending two MySQL packet,
  one Client Authentication packet, and one QUIT packet, to correctly close
  MySQL session. We then parse the MySQL Handshake Initialisation packet and/or
  Error packet. It is a basic but useful test which does not produce error nor
  aborted connect on the server.

For reference my haproxy is configured like this:

I changed the config so that my management ips are 10.0.5.x where x is...
.10 management vip
.2 node this config came from
.3 other node
.4 other node

listen mysqld
  bind 10.0.5.10:3306
  balance roundrobin
  mode tcp
  option mysql-check user cluster_watcher
  option tcplog
  option clitcpka
  option srvtcpka
  timeout client 28801s
  timeout server 28801s
  server node-16 10.0.5.2:3307 check inter 15s fastinter 2s downinter 1s ris
e 5 fall 3
  server node-17 10.0.5.3:3307 check inter 15s fastinter 2s downinter 1s ris
e 5 fall 3 backup
  server node-18 10.0.5.4:3307 check inter 15s fastinter 2s downinter 1s ris
e 5 fall 3 backup

tags: added: backports-4.1.1
Changed in fuel:
milestone: 4.1.1 → 5.0
tags: removed: backports-4.1.1
Changed in fuel:
importance: High → Medium
Andrew Woodward (xarses)
tags: added: ha
Changed in fuel:
milestone: 5.0 → 5.1
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
tags: added: to-be-covered-by-tests
Dmitry Ilyin (idv1985)
summary: - while Galera node is in Sync or Donor state many services are down
+ [puppet] while Galera node is in Sync or Donor state many services are
+ down
Dmitry Ilyin (idv1985)
summary: - [puppet] while Galera node is in Sync or Donor state many services are
+ [library] while Galera node is in Sync or Donor state many services are
down
Changed in fuel:
status: Triaged → Fix Committed
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Please comment why you beilieve this bug has been fixed in 5.1.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

It was implemented in https://review.openstack.org/#/c/106516/ Blueprint: galera-improvements

tags: added: in progress
Revision history for this message
Anastasia Palkina (apalkina) wrote :
Download full text (5.3 KiB)

Verified on ISO #11

"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "auth_required": true, "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}}}, "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"

I deployed CentOS, HA with 3 controllers.
After it I restart second controller and caught on first controller:

[root@node-1 ~]# mysql -e "show status like 'wsrep_local_state_comment';"
+---------------------------+----------------+
| Variable_name | Value |
+---------------------------+----------------+
| wsrep_local_state_comment | Donor/Desynced |
+---------------------------+----------------+

Immediately:

[root@node-1 ~]# nova service-list
+------------------+-------------------+----------+---------+-------+----------------------------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+------------------+-------------------+----------+---------+-------+----------------------------+-----------------+
| nova-consoleauth | node-1.domain.tld | internal | enabled | up | 2014-09-29T13:23:36.000000 | - |
| nova-scheduler | node-1.domain.tld | internal | enabled | up | 2014-09-29T13:23:36.000000 | - |
| nova-conductor | node-1.domain.tld | internal | enabled | up | 2014-09-29T13:23:40.000000 | - |
| nova-cert | node-1.domain.tld | internal | enabled | up | 2014-09-29T13:23:36.000000 | - |
| nova-consoleauth | node-3.domain.tld | internal | enabled | up | 2014-09-29T13:23:38.000000 | - |
| nova-scheduler | node-3.domain.tld | internal | enabled | up | 2014-09-29T13:23:38.000000 | - |
| nova-conductor | node-3.domain.tld | internal | enabled | up | 2014-09-29T13:23:39.000000 | - |
| nova-consoleauth | node-2.domain.tld | internal | enabled | down | 2014-09-29T13:22:12.000000 | - |
| nova-scheduler | node-2.domain.tld | internal | enabled | down | 2014-09-29T13:22:12.000000 | - |
| nova-conductor | node-2.domain.tld | internal | enabled | down | 2014-09-29T13:22:10.000000 | - |
| nova-cert | node-3.domain.tld | internal | enabled | up | 2014-09-29T13:23:41.000000 | - |
| nova-cert | node-2.domain.tld | internal | enabled | down | 2014-09-29T13:22:12.000000 | - |
| n...

Read more...

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: in progress
tags: added: in progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.