tripleo

Pacemaker being required for non-HA deployment

Bug #1973460 reported by Cristian Le on 2022-05-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	New	Undecided	Unassigned

Bug Description

I have been trying to deploy the simplest TripleO stack for weeks, and the latest issue is that the deploy is stuck in a loop performing this command on the controller node:
```
Debug: Exec[wait-for-settle](provider=posix): Executing check '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless: Error: error running crm_mon, is pacemaker running?
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless: crm_mon: Error: cluster is not available on this node
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Exec try 1/360
Debug: Exec[wait-for-settle](provider=posix): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
Debug: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Sleeping for 10.0 seconds between tries
```
But I am not trying to deploy a HA cluster, and even if I include `docker-ha.yaml`, this issue persists. For reference, this is the answers-file I tried to use:
```
templates: /usr/share/openstack-tripleo-heat-templates/
environments:
  - /usr/share/openstack-tripleo-heat-templates/environments/deployed-server-environment.yaml
  - ./overcloud-baremetal-deployed.yaml
  - ./overcloud-networks-deployed.yaml
  - ./overcloud-vips-deployed.yaml
  - ./containers-prepare-parameter.yaml
```
And the baremetal_deployment.yaml
```
- name: Controller
  count: 1
  defaults:
    networks:
    - network: ctlplane
      vif: true
    - network: internalapi
    - network: storagemgmt
    - network: storage
    network_config:
      template: templates/single_nic_vlans/single_nic_vlans.j2
      default_route_network:
      - ctlplane
  instances:
  - hostname: controller.arda
    name: controller
- name: Compute
  count: 1
  defaults:
    networks:
    - network: ctlplane
      vif: true
    - network: internalapi
    - network: storage
    network_config:
      template: templates/single_nic_vlans/single_nic_vlans.j2
      default_route_network:
      - ctlplane
  instances:
  - hostname: compute.arda
    name: compute
```
I do not see this to be anything special, and it is a very basic example according to the deployment guide, yet I cannot get it to work. This is running on master branch with CentOS 9 stream. I am attaching a more detailed archive of this setup:

Tags:

Revision history for this message

Cristian Le (lecris) wrote on 2022-05-15:

TestTripleO.tar.gz Edit (62.8 MiB, application/x-tar)

Revision history for this message

Brendan Shephard (bshephar) wrote on 2022-05-16:

Looks like it failed when Puppet tried to configure pacemaker:
416460 2022-05-15 16:10:08,766 p=453290 u=stack n=ansible | 2022-05-15 16:10:08.766562 | 002590f0-0f45-829a-6014-000000002ebe | FATAL | Wait for puppet host configuration to finish | co ntroller.arda | error={

2022-05-15 13:24:12 +0000 Exec[wait-for-settle](provider=posix) (debug): Executing check '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:12 +0000 Puppet (debug): Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:12 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless (debug): Error: error running crm_mon, is pacemaker running?
2022-05-15 13:24:12 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless (debug): crm_mon: Error: cluster is not available on this node
2022-05-15 13:24:12 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Exec try 1/360
2022-05-15 13:24:12 +0000 Exec[wait-for-settle](provider=posix) (debug): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:12 +0000 Puppet (debug): Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:13 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Sleeping for 10.0 seconds between tries
2022-05-15 13:24:23 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Exec try 2/360
2022-05-15 13:24:23 +0000 Exec[wait-for-settle](provider=posix) (debug): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:23 +0000 Puppet (debug): Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:23 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Sleeping for 10.0 seconds between tries
2022-05-15 13:24:28 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (err): change from 'notrun' to ['0'] failed: exit

I assume the pacemaker service is failing to start? systemctl status pcsd

Was this node used for a previous deployment and not cleaned? ie, could this node already be configured from a previous deployment with a hacluster username and password?

Let's check what is preventing pcsd from starting on the Controller:
systemctl status pcsd
journalctl -u pcsd -e

Looks like it failed when Puppet tried to configure pacemaker:
416460 2022-05-15 16:10:08,766 p=453290 u=stack n=ansible | 2022-05-15 16:10:08.766562 | 002590f0-0f45-829a-6014-000000002ebe |      FATAL | Wait for puppet host configuration to finish | co       ntroller.arda | error={

2022-05-15 13:24:12 +0000 Exec[wait-for-settle](provider=posix) (debug): Executing check '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:12 +0000 Puppet (debug): Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:12 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless (debug): Error: error running crm_mon, is pacemaker running?
2022-05-15 13:24:12 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/unless (debug):   crm_mon: Error: cluster is not available on this node
2022-05-15 13:24:12 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Exec try 1/360
2022-05-15 13:24:12 +0000 Exec[wait-for-settle](provider=posix) (debug): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:12 +0000 Puppet (debug): Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:13 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Sleeping for 10.0 seconds between tries
2022-05-15 13:24:23 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Exec try 2/360
2022-05-15 13:24:23 +0000 Exec[wait-for-settle](provider=posix) (debug): Executing '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:23 +0000 Puppet (debug): Executing: '/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1'
2022-05-15 13:24:23 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (debug): Sleeping for 10.0 seconds between tries
2022-05-15 13:24:28 +0000 /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns (err): change from 'notrun' to ['0'] failed: exit

I assume the pacemaker service is failing to start? systemctl status pcsd

Was this node used for a previous deployment and not cleaned? ie, could this node already be configured from a previous deployment with a hacluster username and password?

Let's check what is preventing pcsd from starting on the Controller:
systemctl status pcsd
journalctl -u pcsd -e

Revision history for this message

Cristian Le (lecris) wrote on 2022-05-16:

I will update with the specifics in a bit. But yes pcsd is failing to start, and no all of the deployments are starting from scratch (I am yet to successfully deploy a cluster).

The services `corosync` and `pacemaker` are enabled but not started, the former lacking a config file, and the latter complaining that it is lacking dependencies. I think there are two separate services `pacemaker` and `pcsd`, and I don't remember what fails in the latter.

Revision history for this message

Cristian Le (lecris) wrote on 2022-05-16 (last edit on 2022-05-16):

Controller logs Edit (13.4 MiB, application/x-tar)

`pcsd` service is booted up ok. It seems to be complaining about `pacemaker` service instead though. I have attached the logs reproduced for the controller node.

Also, everything is running on baremetal. Versions are `pacemaker 2.1.2-4.el9` and `pcs 0.11.1`.

@bshephar Can you forward/ping/assign to the relevant pacemaker team?

Revision history for this message

Brendan Shephard (bshephar) wrote on 2022-05-16:

Download full text (3.4 KiB)

pcsd seems to start fine:

❯ cat pcsd.log
I, [2022-05-16T04:43:15.041 #00000] INFO -- : Starting server...
I, [2022-05-16T04:43:15.045 #00000] INFO -- : Binding socket for address '192.168.24.43' and port '2224'
I, [2022-05-16T04:43:15.046 #00000] INFO -- : Server is listening
I, [2022-05-16T04:43:15.047 #00000] INFO -- : Notifying systemd we are running (socket '/run/systemd/notify')
I, [2022-05-16T04:43:15.068 #00001] INFO -- : Config files sync started
I, [2022-05-16T04:43:15.068 #00001] INFO -- : Config files sync skipped, this host does not seem to be in a cluster of at least 2 nodes
I, [2022-05-16T04:43:17.575 #00000] INFO -- : Attempting login by 'hacluster'
I, [2022-05-16T04:43:17.651 #00000] INFO -- : Successful login by 'hacluster'
I, [2022-05-16T04:43:17.666 #00002] INFO -- : Empty file '/var/lib/pcsd/pcs_users.conf', creating new file
I, [2022-05-16T04:43:17.666 #00000] INFO -- : 200 POST /remote/auth (192.168.24.43) 100.13ms

Corosync seems to be failing because of a missing conf file:
❯ cat controller-corosync.log
May 16 04:58:14 controller.arda systemd[1]: Starting Corosync Cluster Engine...
May 16 04:58:14 controller.arda corosync[21245]: Can't read file /etc/corosync/corosync.conf: No such file or directory
May 16 04:58:14 controller.arda systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
May 16 04:58:14 controller.arda systemd[1]: corosync.service: Failed with result 'exit-code'.
May 16 04:58:14 controller.arda systemd[1]: Failed to start Corosync Cluster Engine.

I noticed that you have a custom network_data.yaml file, but you're not including it in your deploy command. And there are some errors related to network configuration on the node:

❯ cat network_data.yaml
- name: InternalApi
  dns_domain: internalapi.arda.
  vip: true
  subnets:
    internal_api_subnet01:
      ip_subnet: 192.168.2.0/24
      allocation_pools:
        - start: 192.168.2.10
          end: 192.168.2.250
      vlan: 20
- name: StorageMgmt
  dns_domain: storagemgmt.arada.
  vip: true
  subnets:
    storage_mgmt_subnet01:
      ip_subnet: 192.168.3.0/24
      allocation_pools:
        - start: 192.168.3.10
          end: 192.168.3.250
      vlan: 30
- name: Storage
  dns_domain: storage.arda.
  vip: true
  subnets:
    storage_subnet01:
      ip_subnet: 10.0.3.0/24
      allocation_pools:
        - start: 10.0.3.10
          end: 10.0.3.250
      vlan: 130

❯ cat Step6_Deploy_overcloud.sh
#!/bin/bash
source stackrc
openstack overcloud deploy \
--stack overcloud \
--answers-file overcloud-deploy-answers.yaml

This would need a -n /home/stack/network_data.yaml if you wanted to have your custom networks configured. Example:

#!/bin/bash
source stackrc
openstack overcloud deploy \
        --stack overcloud \
        -n /home/stack/network_data.yaml \
        --answers-file overcloud-deploy-answers.yaml

❯ grep Error var/log/os-net-config.log -A4
2022-05-16 00:28:54.509 WARNING os_net_config.impl_ifcfg.apply Error in 'ip route add default via 192.168.24.1 dev br-ex', restarting br-ex:
Unexpected error while running command.
Command: /sbin/ip route add default via 192.168...

pcsd seems to start fine:

❯ cat pcsd.log 
I, [2022-05-16T04:43:15.041 #00000]     INFO -- : Starting server...
I, [2022-05-16T04:43:15.045 #00000]     INFO -- : Binding socket for address '192.168.24.43' and port '2224'
I, [2022-05-16T04:43:15.046 #00000]     INFO -- : Server is listening
I, [2022-05-16T04:43:15.047 #00000]     INFO -- : Notifying systemd we are running (socket '/run/systemd/notify')
I, [2022-05-16T04:43:15.068 #00001]     INFO -- : Config files sync started
I, [2022-05-16T04:43:15.068 #00001]     INFO -- : Config files sync skipped, this host does not seem to be in a cluster of at least 2 nodes
I, [2022-05-16T04:43:17.575 #00000]     INFO -- : Attempting login by 'hacluster'
I, [2022-05-16T04:43:17.651 #00000]     INFO -- : Successful login by 'hacluster'
I, [2022-05-16T04:43:17.666 #00002]     INFO -- : Empty file '/var/lib/pcsd/pcs_users.conf', creating new file
I, [2022-05-16T04:43:17.666 #00000]     INFO -- : 200 POST /remote/auth (192.168.24.43) 100.13ms

I noticed that you have a custom network_data.yaml file, but you're not including it in your deploy command. And there are some errors related to network configuration on the node:

❯ cat network_data.yaml 
- name: InternalApi
  dns_domain: internalapi.arda.
  vip: true
  subnets:
    internal_api_subnet01:
      ip_subnet: 192.168.2.0/24
      allocation_pools:
        - start: 192.168.2.10
          end: 192.168.2.250
      vlan: 20
- name: StorageMgmt
  dns_domain: storagemgmt.arada.
  vip: true
  subnets:
    storage_mgmt_subnet01:
      ip_subnet: 192.168.3.0/24
      allocation_pools:
        - start: 192.168.3.10
          end: 192.168.3.250
      vlan: 30
- name: Storage
  dns_domain: storage.arda.
  vip: true
  subnets:
    storage_subnet01:
      ip_subnet: 10.0.3.0/24
      allocation_pools:
        - start: 10.0.3.10
          end: 10.0.3.250
      vlan: 130

❯ cat Step6_Deploy_overcloud.sh 
#!/bin/bash
source stackrc
openstack overcloud deploy \
        --stack overcloud \
        --answers-file overcloud-deploy-answers.yaml

This would need a -n /home/stack/network_data.yaml if you wanted to have your custom networks configured. Example:

#!/bin/bash
source stackrc
openstack overcloud deploy \
        --stack overcloud \
        -n /home/stack/network_data.yaml \
        --answers-file overcloud-deploy-answers.yaml

I'm not sure if that would contribute to the pacemaker issue. But I thought it was worth pointing out just in case your were intending to use different networks to those that are being configured.

Revision history for this message

Brendan Shephard (bshephar) wrote on 2022-05-16:

Looks like it did configure your custom networks actually, I can see evidence of one of them here:
# This file is autogenerated by os-net-config
DEVICE=vlan130
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSIntPort
OVS_BRIDGE=br-ex
OVS_OPTIONS="tag=130"
MTU=1500
BOOTPROTO=static
IPADDR=10.0.3.12
NETMASK=255.255.255.0

I guess you fixed that already.

Revision history for this message

Cristian Le (lecris) wrote on 2022-05-16 (last edit on 2022-05-16):

@bshephar Thank you for the advice. Didn't know that we still need to add `--networks-file` even after we specify them in `overcloud node provide`. The networks seemed to have been working lately without it, but I will try adding that from now on.

Otherwise, with those instructions of mine you could deploy them and everything is working on your side? Is there any repository or container differences? I was using latest `tripleo-repos current` and image from https://images.rdoproject.org/centos9/master/rdo_trunk/current-tripleo/

Edit: Seems that in my latest attempt, I did add `--networks-file` with the same issue occurring. Also, don't know if it's intended, but all but ctlplane ports are down when I query from undercloud. Maybe because undercloud is not on those networks?

Revision history for this message

Cristian Le (lecris) wrote on 2022-05-16:

Wait, why is my `pcsd` saying that it starts GUI?
```
May 16 12:03:50 controller.arda systemd[1]: Starting PCS GUI and remote configuration interface...
May 16 12:03:51 controller.arda systemd[1]: Started PCS GUI and remote configuration interface.
```

Revision history for this message

Rabi Mishra (rabi) wrote on 2022-06-27:

The issue is your network_data.yaml does not have 'name_lower' key for networks.

- name: InternalApi
  dns_domain: internalapi.arda.
  vip: true
  subnets:
    internal_api_subnet01:
      ip_subnet: 192.168.2.0/24
      allocation_pools:
        - start: 192.168.2.10
          end: 192.168.2.250
      vlan: 20

Therefore most of the service networks default to ctlplane as the default logic does not work[1]. You either add 'name_lower' key to network_data.yaml or override ServiceNetMap parameter for networks to use 'internalapi'[2] (not default 'internal_api'). The specific pacemaker issue is you're facing due to HostnameResolveNetwork falling back to ctlplane.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud-resource-registry-puppet.j2.yaml#L365

[2] servicenet_map_env.yaml
parameter_defaults:
  ServiceNetMap:
    ApacheNetwork: internalapi
    NeutronTenantNetwork: internalapi

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-27: Related fix proposed to tripleo-heat-templates (master)

#10

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/847758

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-28: Related fix merged to tripleo-heat-templates (master)

#11

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/847758
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/da1e3e250313f80da47c3241f1c05cb15c123ed7
Submitter: "Zuul (22348)"
Branch: master

commit da1e3e250313f80da47c3241f1c05cb15c123ed7
Author: rabi <email address hidden>
Date: Mon Jun 27 14:41:46 2022 +0530

Simplify HostnameResolveNetwork in ServiceNetMap

We're using hardcoded network names which may not be the case
when using custom networks and it's overly complex atm.

Related-Bug: #1973460
Change-Id: I80ef75a5003e2ad8f42473df2f7bbfaffc8320b3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-29: Related fix proposed to tripleo-heat-templates (stable/wallaby)

#12

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/848059

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-29: Related fix merged to tripleo-heat-templates (stable/wallaby)

#13

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/848059
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/d8696d8b79046a3df365b3a8064735a6eba5b199
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit d8696d8b79046a3df365b3a8064735a6eba5b199
Author: rabi <email address hidden>
Date: Mon Jun 27 14:41:46 2022 +0530

Simplify HostnameResolveNetwork in ServiceNetMap

We're using hardcoded network names which may not be the case
when using custom networks and it's overly complex atm.

    Related-Bug: #1973460
    Change-Id: I80ef75a5003e2ad8f42473df2f7bbfaffc8320b3
    (cherry picked from commit da1e3e250313f80da47c3241f1c05cb15c123ed7)

tags:

added: in-stable-wallaby

Revision history for this message

Cristian Le (lecris) wrote on 2022-07-04 (last edit on 2022-07-04):

#14

@rabi I should have updated, but the original issue we've pinned down to `baremetal_deployment.yaml` where `instances.hostname` must not have a `.` in the name, i.e.: `compute.stack` -> `compute`.

I agree revising how the networks are parsed is also important here. How does this work in conjunction with `service_net_map_replace` and deploying multiple overclouds from one undercloud?
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/multiple_overclouds.html

Also some documentations should reflect the appropriate naming conventions:
https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html#baremetal-provision-configuration
There might have been other locations from where I have adapted the original example I provided.

Takashi Kajinami (kajinamit) on 2022-09-02

no longer affects:

puppet-pacemaker

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.