Cluster without quorum, after rebooting

Bug #1440723 reported by Alexander Nevenchannyy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Fuel Library (Deprecated)

Bug Description

[root@fuel ~]# fuel --fuel-version
DEPRECATION WARNING: file /etc/fuel/client/config.yaml is found and will be used as a source for settings. However, it deprecated and will not be used by default in the ongoing version of python-fuelclient.
api: '1.0'
astute_sha: 3f1ece0318e5e93eaf48802fefabf512ca1dce40
auth_required: true
build_id: 2015-03-26_21-32-43
build_number: '233'
feature_groups:
- mirantis
fuellib_sha: 9c7716bc2ce6075065d7d9dcf96f4c94662c0b56
fuelmain_sha: 320b5f46fc1b2798f9e86ed7df51d3bda1686c10
nailgun_sha: b163f6fc77d6639aaffd9dd992e1ad96951c3bbf
ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
production: docker
python-fuelclient_sha: e5e8389d8d481561a4d7107a99daae07c6ec5177
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 3f1ece0318e5e93eaf48802fefabf512ca1dce40
      build_id: 2015-03-26_21-32-43
      build_number: '233'
      feature_groups:
      - mirantis
      fuellib_sha: 9c7716bc2ce6075065d7d9dcf96f4c94662c0b56
      fuelmain_sha: 320b5f46fc1b2798f9e86ed7df51d3bda1686c10
      nailgun_sha: b163f6fc77d6639aaffd9dd992e1ad96951c3bbf
      ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
      production: docker
      python-fuelclient_sha: e5e8389d8d481561a4d7107a99daae07c6ec5177
      release: '6.1'

Deployment configuration: Ubuntu with HA, ceph, 200 bare-metal nodes

[root@fuel ~]# fuel node | grep controller
DEPRECATION WARNING: file /etc/fuel/client/config.yaml is found and will be used as a source for settings. However, it deprecated and will not be used by default in the ongoing version of python-fuelclient.
198 | ready | Untitled (07:e2) | 1 | 10.20.1.198 | 0c:c4:7a:1f:07:e2 | controller | | True | 1
200 | ready | Untitled (07:ce) | 1 | 10.20.1.200 | 0c:c4:7a:1f:07:ce | controller | | True | 1
199 | ready | Untitled (a9:34) | 1 | 10.20.1.199 | 0c:c4:7a:1e:a9:34 | controller | | True | 1

root@node-198:~# crm status
Last updated: Mon Apr 6 12:28:43 2015
Last change: Mon Apr 6 09:35:31 2015 via crm_attribute on node-198.domain.tld
Stack: corosync
Current DC: node-198.domain.tld (1) - partition WITHOUT quorum
Version: 1.1.10-42f2063
3 Nodes configured
37 Resources configured

Online: [ node-198.domain.tld ]
OFFLINE: [ node-199.domain.tld node-200.domain.tld ]

root@node-199:~# crm status
Last updated: Mon Apr 6 12:29:17 2015
Last change: Mon Apr 6 10:04:49 2015 via crm_attribute on node-199.domain.tld
Stack: corosync
Current DC: node-198.domain.tld (1) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
37 Resources configured

Node node-198.domain.tld (1): maintenance
Online: [ node-199.domain.tld node-200.domain.tld ]

 vip__public_vrouter (ocf::fuel:ns_IPaddr2): Started node-198.domain.tld (unmanaged)
 vip__management_vrouter (ocf::fuel:ns_IPaddr2): Started node-198.domain.tld (unmanaged)
 vip__public (ocf::fuel:ns_IPaddr2): Started node-200.domain.tld
 Clone Set: clone_ping_vip__public [ping_vip__public]
     ping_vip__public (ocf::pacemaker:ping): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-198.domain.tld (unmanaged)
 Clone Set: clone_p_haproxy [p_haproxy]
     p_haproxy (ocf::fuel:ns_haproxy): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_dns [p_dns]
     p_dns (ocf::fuel:ns_dns): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_ntp [p_ntp]
     p_ntp (ocf::fuel:ns_ntp): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_mysql [p_mysql]
     p_mysql (ocf::fuel:mysql-wss): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-200.domain.tld ]
     Stopped: [ node-198.domain.tld node-199.domain.tld ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     p_neutron-plugin-openvswitch-agent (ocf::fuel:ocf-neutron-ovs-agent): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     p_neutron-dhcp-agent (ocf::fuel:ocf-neutron-dhcp-agent): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     p_neutron-metadata-agent (ocf::fuel:ocf-neutron-metadata-agent): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     p_neutron-l3-agent (ocf::fuel:ocf-neutron-l3-agent): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     p_heat-engine (ocf::fuel:heat-engine): Started node-198.domain.tld (unmanaged)
     Started: [ node-199.domain.tld node-200.domain.tld ]

Failed actions:
    p_mysql_monitor_120000 (node=node-199.domain.tld, call=224, rc=7, status=complete, last-rc-change=Sat Apr 4 09:20:32 2015
, queued=0ms, exec=0ms
): not running

root@node-200:~# crm status
Last updated: Mon Apr 6 12:29:58 2015
Last change: Mon Apr 6 10:00:18 2015 via crm_resource on node-200.domain.tld
Stack: corosync
Current DC: node-198.domain.tld (1) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
37 Resources configured

Online: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]

 vip__public_vrouter (ocf::fuel:ns_IPaddr2): Started node-198.domain.tld
 vip__management_vrouter (ocf::fuel:ns_IPaddr2): Started node-198.domain.tld
 vip__public (ocf::fuel:ns_IPaddr2): Started node-200.domain.tld
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-198.domain.tld
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-200.domain.tld ]
     Stopped: [ node-198.domain.tld node-199.domain.tld ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-198.domain.tld node-199.domain.tld node-200.domain.tld ]

Failed actions:
    p_mysql_monitor_120000 (node=node-199.domain.tld, call=224, rc=7, status=complete, last-rc-change=Sat Apr 4 09:20:32 2015
, queued=0ms, exec=0ms
): not running

Tags: ha scale
Revision history for this message
Alexander Nevenchannyy (anevenchannyy) wrote :

Only node-198 was rebooted

Changed in fuel:
importance: Undecided → High
status: New → Confirmed
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Alexander Evseev (aevseev) wrote :

More info: all three controllers use same config differrent only in "bindnetaddr:" and with same "mcastport: 5405". On node-198 tcpdump shows outgoing packets to two other controllers:

root@node-198:~# tcpdump -i br-mgmt -nn -p -s0 udp and port 5405
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-mgmt, link-type EN10MB (Ethernet), capture size 65535 bytes
12:47:46.639418 IP 192.168.0.201.55764 > 192.168.0.202.5405: UDP, length 87
12:47:46.639439 IP 192.168.0.201.53805 > 192.168.0.203.5405: UDP, length 87
12:47:46.839517 IP 192.168.0.201.55764 > 192.168.0.202.5405: UDP, length 87
12:47:46.839536 IP 192.168.0.201.53805 > 192.168.0.203.5405: UDP, length 87

On two other controller may see incoming packets:

root@node-199:~# tcpdump -i br-mgmt -nn -p -s0 udp and port 5405
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-mgmt, link-type EN10MB (Ethernet), capture size 65535 bytes
12:48:27.515548 IP 192.168.0.201.55764 > 192.168.0.202.5405: UDP, length 87
12:48:27.716039 IP 192.168.0.201.55764 > 192.168.0.202.5405: UDP, length 87

and

root@node-200:~# tcpdump -i br-mgmt -nn -p -s0 udp and port 5405
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-mgmt, link-type EN10MB (Ethernet), capture size 65535 bytes
12:48:29.435332 IP 192.168.0.201.53805 > 192.168.0.203.5405: UDP, length 87
12:48:29.660374 IP 192.168.0.201.53805 > 192.168.0.203.5405: UDP, length 87

Config (bindnetaddr is 192.168.0.201, 192.168.0.202, 192.168.0.203):

# cat /etc/corosync/corosync.conf

compatibility: whitetank

quorum {
  provider: corosync_votequorum

     two_node: 0

}

nodelist {

  node {
    ring0_addr: 192.168.0.201
    nodeid: 1
  }

  node {
    ring0_addr: 192.168.0.202
    nodeid: 2
  }

  node {
    ring0_addr: 192.168.0.203
    nodeid: 3
  }

}

totem {
  version: 2
  token: 3000
  token_retransmits_before_loss_const: 10
  join: 60
  consensus: 3600
  vsftype: none
  max_messages: 20
  clear_node_high_bit: yes
  rrp_mode: none
  secauth: off
  threads: 40
  transport: udpu
  interface {
    member {
      memberaddr: 192.168.0.201
    }
    member {
      memberaddr: 192.168.0.202
    }
    member {
      memberaddr: 192.168.0.203
    }
    ringnumber: 0
    bindnetaddr: 192.168.0.201
    mcastport: 5405
  }
}

logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  logfile: /var/log/corosync.log
  to_syslog: yes
  syslog_facility: daemon
  syslog_priority: info
  debug: off
  function_name: on
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: off
    tags: enter|leave|trace1|trace2|trace3|trace4|trace6
  }
}

amf {
  mode: disabled
}

aisexec {
  user: root
  group: root
}

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please check if this bug is duplicated by https://bugs.launchpad.net/fuel/+bug/1441435

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I set this bug as a dup of https://bugs.launchpad.net/fuel/+bug/1441435
If this is wrong, and there was no LACP involved for this issue, please unlink it

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.