Bug #1293680 “[library] while Galera node is in Sync or Donor st...” : Bugs : Fuel for OpenStack

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-03-18:

#1

Please provide more comprhensive and complete bug description. Which services are down? On which nodes? On which node are you issuing this mysql command?

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Brad Durrow (l-brad) wrote on 2014-03-18:

#2

Download full text (5.7 KiB)

Precondition:
HA Cluster Fuel 4.0
All controllers and services are up as reported by crm status
One controller recently rebooted or had mysql restarted

When I use the command "nova service-list" (as an example) to get a list of openstack nodes and roles I expect output that contains a list of openstack services. Instead I get a timeout.

For reference:
node-16=10.0.5.2
node-17=10.0.5.3
node-18=10.0.5.4
vip__management_old=10.0.5.10

[root@node-18 ~]# nova service-list
ERROR: HTTPConnectionPool(host='10.0.5.10', port=5000): Max retries exceeded with url: /v2.0/tokens
[root@node-18 ~]# crm status
Last updated: Tue Mar 18 06:50:48 2014
Last change: Tue Mar 18 06:39:53 2014 via crmd on node-16.domain.com
Stack: openais
Current DC: node-16.domain.com - partition with quorum
Version: 1.1.8-1.el6-1f8858c
3 Nodes configured, 3 expected votes
19 Resources configured.

Online: [ node-16.domain.com node-17.domain.com node-18.domain.com ]

vip__management_old (ocf::heartbeat:IPaddr2): Started node-16.domain.com
vip__public_old (ocf::heartbeat:IPaddr2): Started node-17.domain.com
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started node-18.domain.com
p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-16.domain.com
openstack-heat-engine (ocf::mirantis:openstack-heat-engine): Started node-17.domain.com
p_openstack-ceilometer-central (ocf::mirantis:ceilometer-agent-central): Started node-18.domain.com
p_openstack-ceilometer-alarm-evaluator (ocf::mirantis:ceilometer-alarm-evaluator): Started node-16.domain.com

Failed actions:
p_mysql_monitor_60000 (node=node-18.domain.com, call=95, rc=7, status=complete): not running
[root@node-18 ~]# crm resource restart clone_p_haproxy
INFO: ordering clone_p_haproxy to stop
INFO: ordering clone_p_haproxy to start
[root@node-18 ~]# mysql -e "show status like 'wsrep%';"
+----------------------------+----------------------------------------------+
| Variable_name | Value |
+----------------------------+----------------------------------------------+
| wsrep_local_state_uuid | |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 18446744073709551615 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 1 |
| wsrep_received_bytes | 274 |
| wsrep_local_commits | 0 ...

Precondition:
HA Cluster Fuel 4.0
All controllers and services are up as reported by crm status
One controller recently rebooted or had mysql restarted

When I use the command "nova service-list" (as an example) to get a list of openstack nodes and roles I expect output that contains a list of openstack services.  Instead I get a timeout.

For reference:
node-16=10.0.5.2
node-17=10.0.5.3
node-18=10.0.5.4
vip__management_old=10.0.5.10

[root@node-18 ~]# nova service-list
ERROR: HTTPConnectionPool(host='10.0.5.10', port=5000): Max retries exceeded with url: /v2.0/tokens
[root@node-18 ~]# crm status
Last updated: Tue Mar 18 06:50:48 2014
Last change: Tue Mar 18 06:39:53 2014 via crmd on node-16.domain.com
Stack: openais
Current DC: node-16.domain.com - partition with quorum
Version: 1.1.8-1.el6-1f8858c
3 Nodes configured, 3 expected votes
19 Resources configured.

Online: [ node-16.domain.com node-17.domain.com node-18.domain.com ]

vip__management_old	(ocf::heartbeat:IPaddr2):	Started node-16.domain.com
 vip__public_old	(ocf::heartbeat:IPaddr2):	Started node-17.domain.com
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 Clone Set: clone_p_neutron-openvswitch-agent [p_neutron-openvswitch-agent]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-16.domain.com node-17.domain.com node-18.domain.com ]
 p_neutron-dhcp-agent	(ocf::mirantis:neutron-agent-dhcp):	Started node-18.domain.com
 p_neutron-l3-agent	(ocf::mirantis:neutron-agent-l3):	Started node-16.domain.com
 openstack-heat-engine	(ocf::mirantis:openstack-heat-engine):	Started node-17.domain.com
 p_openstack-ceilometer-central	(ocf::mirantis:ceilometer-agent-central):	Started node-18.domain.com
 p_openstack-ceilometer-alarm-evaluator	(ocf::mirantis:ceilometer-alarm-evaluator):	Started node-16.domain.com

Failed actions:
    p_mysql_monitor_60000 (node=node-18.domain.com, call=95, rc=7, status=complete): not running
[root@node-18 ~]# crm resource restart clone_p_haproxy
INFO: ordering clone_p_haproxy to stop
INFO: ordering clone_p_haproxy to start
[root@node-18 ~]# mysql -e "show status like 'wsrep%';"
+----------------------------+----------------------------------------------+
| Variable_name              | Value                                        |
+----------------------------+----------------------------------------------+
| wsrep_local_state_uuid     |                                              |
| wsrep_protocol_version     | 4                                            |
| wsrep_last_committed       | 18446744073709551615                         |
| wsrep_replicated           | 0                                            |
| wsrep_replicated_bytes     | 0                                            |
| wsrep_received             | 1                                            |
| wsrep_received_bytes       | 274                                          |
| wsrep_local_commits        | 0                                            |
| wsrep_local_cert_failures  | 0                                            |
| wsrep_local_bf_aborts      | 0                                            |
| wsrep_local_replays        | 0                                            |
| wsrep_local_send_queue     | 0                                            |
| wsrep_local_send_queue_avg | 0.000000                                     |
| wsrep_local_recv_queue     | 40                                           |
| wsrep_local_recv_queue_avg | 0.000000                                     |
| wsrep_flow_control_paused  | 0.000000                                     |
| wsrep_flow_control_sent    | 0                                            |
| wsrep_flow_control_recv    | 0                                            |
| wsrep_cert_deps_distance   | 0.000000                                     |
| wsrep_apply_oooe           | 0.000000                                     |
| wsrep_apply_oool           | 0.000000                                     |
| wsrep_apply_window         | 0.000000                                     |
| wsrep_commit_oooe          | 0.000000                                     |
| wsrep_commit_oool          | 0.000000                                     |
| wsrep_commit_window        | 0.000000                                     |
| wsrep_local_state          | 1                                            |
| wsrep_local_state_comment  | Joining: receiving State Transfer            |
| wsrep_cert_index_size      | 0                                            |
| wsrep_causal_reads         | 0                                            |
| wsrep_incoming_addresses   | 10.0.5.4:3307,10.0.5.2:3307,10.0.5.3:3307    |
| wsrep_cluster_conf_id      | 32                                           |
| wsrep_cluster_size         | 3                                            |
| wsrep_cluster_state_uuid   | 1757119a-ae06-11e3-0800-3a0e97c7179f         |
| wsrep_cluster_status       | Primary                                      |
| wsrep_connected            | ON                                           |
| wsrep_local_index          | 0                                            |
| wsrep_provider_name        | Galera                                       |
| wsrep_provider_vendor      | Codership Oy <info@codership.com>            |
| wsrep_provider_version     | 23.2.2(r137)                                 |
| wsrep_ready                | OFF                                          |
+----------------------------+----------------------------------------------+

Revision history for this message

Andrew Woodward (xarses) wrote on 2014-03-18:

#3

We need to verify that haproxy uses the right node if master is in Joining state transfer or as a donor

@Brad, if you enter this state again, see if mysql is available when you connect to the management-vip:3306

Changed in fuel:
status:	Incomplete → New
tags:	added: customer-found
description:	updated

Vladimir Kuklin (vkuklin) on 2014-03-19

Changed in fuel:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 4.1.1
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-03-19:

#4

As a solution we could also track joining and donor statuses via extended CIB attributes for nodes in pacemaker cluster.
That would allow us to define management (and public, if needed) VIP collocation as "-inf" for both donor and joining nodes, automatically ensuring the VIP will land at the 3rd node running which is "free for tasks"...

Revision history for this message

Brad Durrow (l-brad) wrote on 2014-03-19:

#5

Bogdan, For my fuel 4.0 deployment it looks like services use the management vip to connect to mysql but :3306 has haproxy listening behind it. I believe that it would be more desireable to make mysql connections result in tcp rst so haproxy can quickly fail.

According to the haproxy documentation I found the option mysql-check works like this:

  ...the check consists of sending two MySQL packet,
  one Client Authentication packet, and one QUIT packet, to correctly close
  MySQL session. We then parse the MySQL Handshake Initialisation packet and/or
  Error packet. It is a basic but useful test which does not produce error nor
  aborted connect on the server.

For reference my haproxy is configured like this:

I changed the config so that my management ips are 10.0.5.x where x is...
.10 management vip
.2 node this config came from
.3 other node
.4 other node

listen mysqld
  bind 10.0.5.10:3306
  balance roundrobin
  mode tcp
  option mysql-check user cluster_watcher
  option tcplog
  option clitcpka
  option srvtcpka
  timeout client 28801s
  timeout server 28801s
  server node-16 10.0.5.2:3307 check inter 15s fastinter 2s downinter 1s ris
e 5 fall 3
  server node-17 10.0.5.3:3307 check inter 15s fastinter 2s downinter 1s ris
e 5 fall 3 backup
  server node-18 10.0.5.4:3307 check inter 15s fastinter 2s downinter 1s ris
e 5 fall 3 backup

Vladimir Kuklin (vkuklin) on 2014-03-24

tags:	added: backports-4.1.1
Changed in fuel:
milestone:	4.1.1 → 5.0

Vladimir Kuklin (vkuklin) on 2014-03-26

tags:	removed: backports-4.1.1
Changed in fuel:
importance:	High → Medium

Andrew Woodward (xarses) on 2014-04-04

tags:

added: ha

Vladimir Kuklin (vkuklin) on 2014-04-24

Changed in fuel:
milestone:	5.0 → 5.1

Sergii Golovatiuk (sgolovatiuk) on 2014-06-24

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)

Bogdan Dobrelya (bogdando) on 2014-07-04

tags:

added: to-be-covered-by-tests

Dmitry Ilyin (idv1985) on 2014-07-15

summary:

- while Galera node is in Sync or Donor state many services are down
+ [puppet] while Galera node is in Sync or Donor state many services are
+ down

Dmitry Ilyin (idv1985) on 2014-07-15

summary:

- [puppet] while Galera node is in Sync or Donor state many services are
+ [library] while Galera node is in Sync or Donor state many services are
down

Sergii Golovatiuk (sgolovatiuk) on 2014-07-22

Changed in fuel:
status:	Triaged → Fix Committed

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-07-24:

#6

Please comment why you beilieve this bug has been fixed in 5.1.

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2014-07-24:

#7

It was implemented in https://review.openstack.org/#/c/106516/ Blueprint: galera-improvements

Anastasia Palkina (apalkina) on 2014-09-29

tags:

added: in progress

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2014-09-29:

#8

Download full text (5.3 KiB)

Verified on ISO #11

"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "auth_required": true, "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}}}, "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"

I deployed CentOS, HA with 3 controllers.
After it I restart second controller and caught on first controller:

Immediately:

Verified on ISO #11

"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "auth_required": true, "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-17_21-40-34", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "11", "api": "1.0", "nailgun_sha": "eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d", "production": "docker", "fuelmain_sha": "8ef433e939425eabd1034c0b70e90bdf888b69fd", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"}}}, "fuellib_sha": "d9b16846e54f76c8ebe7764d2b5b8231d6b25079"

I deployed CentOS, HA with 3 controllers.
After it I restart second controller and caught on first controller:

Immediately:

Less than 1 minute:

[root@node-1 ~]# nova service-list
+------------------+-------------------+----------+---------+-------+----------------------------+-----------------+
| Binary           | Host              | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+------------------+-------------------+----------+---------+-------+----------------------------+-----------------+
| nova-consoleauth | node-1.domain.tld | internal | enabled | up    | 2014-09-29T13:24:25.000000 | -               |
| nova-scheduler   | node-1.domain.tld | internal | enabled | up    | 2014-09-29T13:24:25.000000 | -               |
| nova-conductor   | node-1.domain.tld | internal | enabled | up    | 2014-09-29T13:24:20.000000 | -               |
| nova-cert        | node-1.domain.tld | internal | enabled | up    | 2014-09-29T13:24:25.000000 | -               |
| nova-consoleauth | node-3.domain.tld | internal | enabled | up    | 2014-09-29T13:24:18.000000 | -               |
| nova-scheduler   | node-3.domain.tld | internal | enabled | up    | 2014-09-29T13:24:18.000000 | -               |
| nova-conductor   | node-3.domain.tld | internal | enabled | up    | 2014-09-29T13:24:19.000000 | -               |
| nova-consoleauth | node-2.domain.tld | internal | enabled | up    | 2014-09-29T13:24:20.000000 | -               |
| nova-scheduler   | node-2.domain.tld | internal | enabled | up    | 2014-09-29T13:24:22.000000 | -               |
| nova-conductor   | node-2.domain.tld | internal | enabled | up    | 2014-09-29T13:24:18.000000 | -               |
| nova-cert        | node-3.domain.tld | internal | enabled | up    | 2014-09-29T13:24:21.000000 | -               |
| nova-cert        | node-2.domain.tld | internal | enabled | up    | 2014-09-29T13:24:20.000000 | -               |
| nova-compute     | node-4.domain.tld | nova     | enabled | up    | 2014-09-29T13:24:16.000000 | -               |
+------------------+-------------------+----------+---------+-------+----------------------------+-----------------+

Changed in fuel:
status:	Fix Committed → Fix Released
tags:	removed: in progress

Andrey Sledzinskiy (asledzinskiy) on 2014-09-29

tags:

added: in progress

Fuel for OpenStack

[library] while Galera node is in Sync or Donor state many services are down

Bug Description

Duplicates of this bug

Other bug subscribers

Related blueprints

Remote bug watches