Cluster upgrade from 9.0 to 9.1 broke corosync cluster

Bug #1641140 reported by Denis Klepikov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
MOS Maintenance
Mitaka
Won't Fix
High
MOS Maintenance
Newton
Won't Fix
High
MOS Maintenance

Bug Description

Cluster upgrade from 9.0 to 9.1 broke pacemacker/corosync cluster

Steps to reproduce:

1. Create 9.0 cluster (3 controllers, 3 compute+ceph-osd)
2. Changed DNS_DOMAIN and DNS_SEARCH in /etc/fuel/astute.yaml
3. Launch /etc/puppet/modules/fuel/examples/deploy.sh
4. Successfully deployed changes in the environment
5. Everything worked after step 4
6. Upgrade to Fuel 9.1 using https://docs.mirantis.com/openstack/fuel/fuel-9.1/release-notes/update-product.html

after step 11: 'fuel2 update --env <ENV_ID> install'
deployment is in error state.
Pacemacker shows most services in stopped state.

# pcs status
Cluster name:
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Fri Nov 11 15:27:38 2016 Last change: Thu Nov 10 16:03:59 2016 by root via crm_resource on node-3.domain.local
Stack: corosync
Current DC: node-1.domain.local (version 1.1.14-70404b0) - partition with quorum
3 nodes and 46 resources configured

Online: [ node-1.domain.local node-2.domain.local node-3.domain.local ]

Full list of resources:

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-2.domain.local node-3.domain.local ]
     Stopped: [ node-1.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Stopped
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Stopped
 Clone Set: clone_p_haproxy [p_haproxy]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_p_mysqld [p_mysqld]
 sysinfo_node-2.domain.local (ocf::pacemaker:SysInfo): Stopped
 sysinfo_node-3.domain.local (ocf::pacemaker:SysInfo): Stopped
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_neutron-l3-agent [neutron-l3-agent]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_neutron-metadata-agent [neutron-metadata-agent]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_neutron-dhcp-agent [neutron-dhcp-agent]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 sysinfo_node-1.domain.local (ocf::pacemaker:SysInfo): Stopped
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]
 Clone Set: clone_p_ntp [p_ntp]
     Stopped: [ node-1.domain.local node-2.domain.local node-3.domain.local ]

PCSD Status:
  node-1.domain.local member (10.220.2.7): Offline
  node-2.domain.local member (10.220.2.5): Offline
  node-3.domain.local member (10.220.2.8): Offline

# crm
crm(live)# status
Last updated: Fri Nov 11 15:28:03 2016 Last change: Thu Nov 10 16:03:59 2016 by root via crm_resource on node-3.domain.local
Stack: corosync
Current DC: node-1.domain.local (version 1.1.14-70404b0) - partition with quorum
3 nodes and 46 resources configured

Online: [ node-1.domain.local node-2.domain.local node-3.domain.local ]

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-2.domain.local node-3.domain.local ]
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-1.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-1.domain.local
crm(live)#

# crm configure show | grep location
location clone_p_vrouter-on-node-2.domain.local clone_p_vrouter 100: node-2.domain.local
location clone_p_vrouter-on-node-3.domain.local clone_p_vrouter 100: node-3.domain.local
location loc_ping_vip__public vip__public \
location vip__management-on-node-1.domain.local vip__management 100: node-1.domain.local
location vip__management-on-node-2.domain.local vip__management 100: node-2.domain.local
location vip__management-on-node-3.domain.local vip__management 100: node-3.domain.local
location vip__public-on-node-1.domain.local vip__public 100: node-1.domain.local
location vip__public-on-node-2.domain.local vip__public 100: node-2.domain.local
location vip__public-on-node-3.domain.local vip__public 100: node-3.domain.local
location vip__vrouter-on-node-1.domain.local vip__vrouter 100: node-1.domain.local
location vip__vrouter-on-node-2.domain.local vip__vrouter 100: node-2.domain.local
location vip__vrouter-on-node-3.domain.local vip__vrouter 100: node-3.domain.local
location vip__vrouter_pub-on-node-1.domain.local vip__vrouter_pub 100: node-1.domain.local
location vip__vrouter_pub-on-node-2.domain.local vip__vrouter_pub 100: node-2.domain.local
location vip__vrouter_pub-on-node-3.domain.local vip__vrouter_pub 100: node-3.domain.local
colocation conntrackd-with-pub-vip inf: master_p_conntrackd:Master vip__vrouter_pub
colocation dns-with-vrouter-ns inf: clone_p_dns clone_p_vrouter
colocation ntp-with-vrouter-ns inf: clone_p_ntp clone_p_vrouter
colocation vip__vrouter-with-vip__vrouter_pub inf: vip__vrouter vip__vrouter_pub
colocation vip_management-with-haproxy inf: vip__management clone_p_haproxy
colocation vip_public-with-haproxy inf: vip__public clone_p_haproxy

Cluster in non-operational.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Marking as Incomplete. Please attach diagnostic snapshot.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Denis Klepikov (dklepikov) wrote :
Sergii Rizvan (srizvan)
Changed in fuel:
importance: Undecided → High
milestone: none → 11.0
status: Incomplete → Confirmed
assignee: nobody → Sergii Rizvan (srizvan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/403655

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Sergii Rizvan (srizvan) wrote :

Root case of the issue is next. After redeployment of a cluster nailgun changes FQDN for nodes in network_metadata section in astute.yaml for cluster (/etc/fuel/cluster/{cluster_id}/astute.yaml). Then puppet fetches new FQDN and tries to apply it to Pacemaker cluster, but this operation is unsuccessful. We observed in puppet logs on nodes next: http://paste.openstack.org/show/590647/
That's why we decided do not change FQDN for nodes in already deployed cluster in case of redeployment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/409111

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Sergii Rizvan (<email address hidden>) on branch: master
Review: https://review.openstack.org/403655
Reason: Abandoned in favor of https://review.openstack.org/#/c/409111/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/410215

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/410221

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Does not affect 9.2 and update procedure - retargeted to 9.3

tags: added: move-to-9.3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/mitaka)

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/410221
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/newton)

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/410215
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: master
Review: https://review.openstack.org/409111
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Sergii Rizvan (srizvan)
Changed in fuel:
assignee: Sergii Rizvan (srizvan) → MOS Maintenance (mos-maintenance)
Changed in fuel:
status: In Progress → Won't Fix
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

I'm closing this as Won't Fix for all series for the following reasons:

 * The issue itself is a very specific one, it does not connected with updates or pacemaker,
   it happens because DNS_DOMAIN is a global setting and will be used each time one deploys
   or updates an environment. If DNS_DOMAIN was changed prior to update it will cause failures
   everywhere. This is, unfortunately, expected.
 * The proper fix changes the structure of a database therefore is not suitable for updates.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: master
Review: https://review.opendev.org/409111
Reason: This repo is retired now, no further work will get merged.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.