Getting message "vip not yet configured" on all Openstack Cluster based services

Bug #1911909 reported by kashif nawaz
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned
OpenStack HA Cluster Charm
Invalid
Undecided
Unassigned

Bug Description

I am trying to deploy Charm based Openstack and Contrail Cluster but each time deployment is getting stuck "vip not yet configured" for all Openstack Cluster based services.

  glance-hacluster/0* blocked idle 172.30.204.152 Resource: res_glance_2a7ea9f_vip not running
glance/1 active idle 1/lxd/2 172.30.204.169 9292/tcp Unit is ready
  glance-hacluster/1 waiting idle 172.30.204.169 Resource: res_glance_2a7ea9f_vip not yet configured
glance/2 active idle 2/lxd/2 172.30.204.157 9292/tcp Unit is ready
  glance-hacluster/2 waiting idle 172.30.204.157 Resource: res_glance_2a7ea9f_vip not yet configured
heat/0* active idle 0/kvm/3 172.30.204.149 8000/tcp,8004/tcp Unit is ready
  contrail-openstack/3 active idle 172.30.204.149 Unit is ready
  heat-hacluster/0* blocked idle 172.30.204.149 Resource: res_heat_411abe0_vip not running
  ntp/6 active idle 172.30.204.149 123/udp chrony: Ready
heat/1 active idle 1/kvm/3 172.30.204.172 8000/tcp,8004/tcp Unit is ready
  contrail-openstack/7 active idle 172.30.204.172 Unit is ready
  heat-hacluster/2 waiting idle 172.30.204.172 Resource: res_heat_411abe0_vip not yet configured
  ntp/12 active idle 172.30.204.172 123/udp chrony: Ready
heat/2 active idle 2/kvm/3 172.30.204.164 8000/tcp,8004/tcp Unit is ready
  contrail-openstack/6 active idle 172.30.204.164 Unit is ready
  heat-hacluster/1 waiting idle 172.30.204.164 Resource: res_heat_411abe0_vip not yet configured
  ntp/11 active idle 172.30.204.164 123/udp chrony: Ready
keystone/0* active idle 0/lxd/4 172.30.204.155 5000/tcp Unit is ready
  keystone-hacluster/0* blocked idle 172.30.204.155 Resource: res_ks_1daef6e_vip not running
keystone/1 active idle 1/lxd/3 172.30.204.165 5000/tcp Unit is ready
  keystone-hacluster/1 waiting idle 172.30.204.165 Resource: res_ks_1daef6e_vip not yet configured
keystone/2 active idle 2/lxd/3 172.30.204.160 5000/tcp Unit is ready
  keystone-hacluster/2 waiting idle 172.30.204.160 Resource: res_ks_1daef6e_vip not yet configured
memcached/0* active idle 0/lxd/5 172.30.205.178 11211/tcp Unit is ready and clustered
memcached/1 active idle 1/lxd/4 172.30.205.163 11211/tcp Unit is ready and clustered
memcached/2 active idle 2/lxd/4 172.30.205.206 11211/tcp Unit is ready and clustered
mysql/0* active idle 0/lxd/6 172.30.205.197 3306/tcp Unit is ready
  mysql-hacluster/0* active idle 172.30.205.197 Unit is ready and clustered
mysql/1 active idle 1/lxd/5 172.30.205.216 3306/tcp Unit is ready
  mysql-hacluster/1 active idle 172.30.205.216 Unit is ready and clustered
mysql/2 active idle 2/lxd/5 172.30.205.169 3306/tcp Unit is ready
  mysql-hacluster/2 active idle 172.30.205.169 Unit is ready and clustered
neutron-api/0* active idle 0/kvm/4 172.30.204.153 9696/tcp Unit is ready
  contrail-openstack/2 active idle 172.30.204.153 Unit is ready
  neutron-hacluster/0* blocked idle 172.30.204.153 Resource: res_neutron_945a966_vip not running
  ntp/5 active idle 172.30.204.153 123/udp chrony: Ready
neutron-api/1 active idle 1/kvm/4 172.30.204.171 9696/tcp Unit is ready
  contrail-openstack/5 active idle 172.30.204.171 Unit is ready
  neutron-hacluster/2 waiting idle 172.30.204.171 Resource: res_neutron_945a966_vip not yet configured
  ntp/10 active idle 172.30.204.171 123/udp chrony: Ready
neutron-api/2 active idle 2/kvm/4 172.30.204.162 9696/tcp Unit is ready
  contrail-openstack/4 active idle 172.30.204.162 Unit is ready
  neutron-hacluster/1 waiting idle 172.30.204.162 Resource: res_neutron_945a966_vip not yet configured
  ntp/8 active idle 172.30.204.162 123/udp chrony: Ready
nova-cloud-controller/0* active idle 0/lxd/7 172.30.204.151 8774/tcp,8778/tcp Unit is ready
  ncc-hacluster/0* blocked idle 172.30.204.151 Resource: res_nova_0d01fbb_vip not running
nova-cloud-controller/1 active idle 1/lxd/6 172.30.204.167 8774/tcp,8778/tcp Unit is ready
  ncc-hacluster/2 waiting idle 172.30.204.167 Resource: res_nova_0d01fbb_vip not yet configured
nova-cloud-controller/2 active idle 2/lxd/6 172.30.204.158 8774/tcp,8778/tcp Unit is ready
  ncc-hacluster/1 waiting idle 172.30.204.158 Resource: res_nova_0d01fbb_vip not yet configured

Tags: cdo-qa
Revision history for this message
kashif nawaz (knawaz) wrote :
Download full text (17.9 KiB)

series: bionic
variables:
  # https://wiki.ubuntu.com/OpenStack/CloudArchive
  # packages for an LTS release come in a form of SRUs
  # do not use cloud:<pocket> for an LTS version as
  # installation hooks will fail. Example:
  openstack-origin: &openstack-origin distro
  #openstack-origin: &openstack-origin cloud:bionic-rocky

  openstack-region: &openstack-region RegionOne

  # !> Important <!
  # configure that value for the API services as if they
  # spawn too many workers you will get inconsistent failures
  # due to CPU overcommit
  worker-multiplier: &worker-multiplier 0.25

  # Number of MySQL connections in the env. Default is not enough
  # for environment of this size. So, bundle declares default of
  # 2000. There's hardly a case for higher than this
  mysql-connections: &mysql-connections 2000

  # MySQL tuning level. Charm default is "safest", this however
  # impacts performance. For spinning platters consider setting this
  # to "fast"
  mysql-tuning-level: &mysql-tuning-level safest

  # Configure RAM allocation params for nova. For hyperconverged
  # nodes, we need to have plenty reserves for service containers,
  # Ceph OSDs, and swift-storage daemons. Those processes will not
  # only directly allocate RAM but also indirectly via pagecache, file
  # system caches, system buffers usage. Adjust for higher density
  # clouds, e.g. high OSD/host ratio or when running >2 service
  # containers/host adapt appropriately.
  reserved-host-memory: &reserved-host-memory 16384
  ram-allocation-ratio: &ram-allocation-ratio 0.999999 # XXX bug 1613839
  cpu-allocation-ratio: &cpu-allocation-ratio 4.0

  # This is Management network, unrelated to OpenStack and other applications
  # OAM - Operations, Administration and Maintenance
  oam-space: &oam-space oam-space

  # This is OpenStack Admin network; for adminURL endpoints
  admin-space: &admin-space oam-space

  # This is OpenStack Public network; for publicURL endpoints
  public-space: &public-space external-space

  # This is OpenStack Internal network; for internalURL endpoints
  internal-space: &internal-space oam-space

  # CEPH configuration
  # CEPH access network
  ceph-public-space: &ceph-public-space ceph-access-space

  # CEPH replication network
  ceph-cluster-space: &ceph-cluster-space ceph-replica-space
  sdn-transport: &sdn-transport sdn-transport

  # Workaround for 'only one default binding supported'
  oam-space-constr: &oam-space-constr spaces=oam-space
  ceph-access-constr: &ceph-access-constr spaces=ceph-access-space
  combi-access-constr: &combi-access-constr spaces=ceph-access-space,oam-space

  # Various VIPs
  aodh-vip: &aodh-vip "172.30.204.132 172.30.205.132"
  cinder-vip: &cinder-vip "172.30.204.133 172.30.205.133"
  dashboard-vip: &dashboard-vip "172.30.205.144"
  glance-vip: &glance-vip "172.30.204.134 172.30.205.134"
  gnocchi-vip: &gnocchi-vip "172.30.204.135 172.30.205.135"
  heat-vip: &heat-vip "172.30.204.136 172.30.205.136"
 ...

Revision history for this message
kashif nawaz (knawaz) wrote :
Download full text (3.5 KiB)

ubuntu@jumphost:~$ maas admin subnets read
Success.
Machine-readable output follows:
[
    {
        "name": "oam",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": true,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5001,
            "space": "oam-space",
            "primary_rack": "4ds8d6",
            "fabric": "fabric-0",
            "fabric_id": 0,
            "resource_uri": "/MAAS/api/2.0/vlans/5001/"
        },
        "cidr": "172.30.205.128/25",
        "rdns_mode": 2,
        "gateway_ip": "172.30.205.129",
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 1,
        "space": "oam-space",
        "resource_uri": "/MAAS/api/2.0/subnets/1/"
    },
    {
        "name": "external",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5003,
            "space": "external-space",
            "primary_rack": null,
            "fabric": "fabric2",
            "fabric_id": 2,
            "resource_uri": "/MAAS/api/2.0/vlans/5003/"
        },
        "cidr": "172.30.204.128/26",
        "rdns_mode": 2,
        "gateway_ip": null,
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 3,
        "space": "external-space",
        "resource_uri": "/MAAS/api/2.0/subnets/3/"
    },
    {
        "name": "overlay",
        "description": "",
        "vlan": {
            "vid": 300,
            "mtu": 9000,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "overlay",
            "secondary_rack": null,
            "id": 5005,
            "space": "sdn-transport",
            "primary_rack": null,
            "fabric": "fabric3",
            "fabric_id": 3,
            "resource_uri": "/MAAS/api/2.0/vlans/5005/"
        },
        "cidr": "192.168.254.0/24",
        "rdns_mode": 2,
        "gateway_ip": null,
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 4,
        "space": "sdn-transport",
        "resource_uri": "/MAAS/api/2.0/subnets/4/"
    },
    {
        "name": "appformix",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5002,
            "space": "appformix-space",
            "primary_rack": null,
            "fabric": "fabric-1",
            "fabric_id": 1,
            "resource_uri": "/MAAS/api/2.0/vlans/5002/"
        },
        "cidr": "172...

Read more...

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We have similar issues with SQA, except that just one HA-cluster is showing this error:
https://oil-jenkins.canonical.com/job/fce_build/140//console

All the occurrences can be found here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1911909

Under the testrun id, at the bottom of the page there is a link to the `full artifacts repository` for the testruns, where crashdumps can be downloaded.

Revision history for this message
Billy Olsen (billy-olsen) wrote :
Download full text (3.3 KiB)

Looking through the logs for the glance units from one of the runs identified in comment #3, I have a strong suspicion that this is related to bug https://bugs.launchpad.net/charm-hacluster/+bug/1874719. A patch has been proposed and merged at https://review.opendev.org/c/openstack/charm-hacluster/+/834034 and is available in charm revision 93 in the latest/edge channel.

Can you please try using the latest/edge channel for focal+ deployments?

Supporting evidence below.

I see the following:

2022-04-03 18:04:03 ERROR unit.hacluster-glance/1.juju-log server.go:327 Pacemaker is down. Please manually start it. Pacemaker or Corosync are still not fully up after waiting for 12 retries. This looks like lp:1874719. Last output: node1(1): member

With pacemaker showing the following errors in the syslog:

Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]: notice: Caught 'Terminated' signal
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]: notice: Shutting down Pacemaker
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]: notice: Stopping pacemaker-controld
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Caught 'Terminated' signal
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Shutting down cluster resource manager
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-attrd[33245]: notice: Setting shutdown[juju-f975de-0-lxd-4]: (unset) -> 1649009637
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: error: Resource start-up disabled since no STONITH resources have been defined
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: error: Either configure some or disable STONITH with the stonith-enabled option
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: Delaying fencing operations until there are resources to manage
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: Scheduling shutdown of node juju-f975de-0-lxd-4
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: warning: Node node1 is unclean!
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: * Shutdown juju-f975de-0-lxd-4
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: warning: Calculated transition 2 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-2.bz2
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: Configuration errors found during scheduler processing, please run "crm_verify -L" to identify issues
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Transition 2 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Complete
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Disconnected from the executor
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Disconnected from Corosync
Apr ...

Read more...

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We are now also seeing this on latest/edge unfortunately, e.g. this testrun:
https://solutions.qa.canonical.com/testruns/testRun/24a134bd-d912-4bfa-9bc8-bf61dbe4fc45

Looking at the logs https://oil-jenkins.canonical.com/artifacts/24a134bd-d912-4bfa-9bc8-bf61dbe4fc45/generated/generated/openstack/juju-crashdump-openstack-2022-04-16-16.29.24.tar.gz on unit 1/lxd/9, nothing stands out regarding the hacluster. I also can't see the messages you refer to in comment #4.

Revision history for this message
Felipe Reyes (freyes) wrote :

This piece of pacemaker log seems to be relevant:

Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Node juju-d527c3-1-lxd-9 state is now member
Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Defaulting to uname -n for the local corosync node name
Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Pacemaker controller successfully started and accepting connections
Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: State transition S_STARTING -> S_PENDING
Apr 16 12:51:37 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Fencer successfully connected
Apr 16 12:51:57 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: warning: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
Apr 16 12:51:57 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: State transition S_ELECTION -> S_INTEGRATION
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: error: Resource start-up disabled since no STONITH resources have been defined
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: error: Either configure some or disable STONITH with the stonith-enabled option
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: notice: Delaying fencing operations until there are resources to manage
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: notice: Calculated transition 0, saving inputs in /var/lib/pacemaker/pengine/pe-input-2.bz2
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: notice: Configuration errors found during scheduler processing, please run "crm_verify -L" to identify issues

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Looking through the crashdump, this does not appear to be related to LP#1874719.

The charm is simply reporting that the resource does not exist in the local set of resources reported by pacemaker. It would be ideal to have the contents of /var/log/pacemaker as well as /etc/corosync/corosync.conf so we can see the rendered configuration file.

Some oddities that are present, which *may* explain a faulty rendered configuration file:

hacluster-placement/1 never sees relation-joined hooks from hacluster-placement/2. Whereas, hacluster-placement/{0,2} both see all the relation-joined hooks from all the units.

Unfortunately, we'll need more information to see what's going on here.

Revision history for this message
Felipe Reyes (freyes) wrote :
Download full text (3.2 KiB)

yeah, I believe this is the issue:

hacluster-placement/1 didn't get to see all the peers joining the hanode relation:

$ grep hacluster-placement machine-lock.log | grep -v update-status | grep 'unit: hacluster'
2022-04-16 12:54:44 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-joined (25; unit: hacluster-placement/0) hook), waited 23s, held 5s
2022-04-16 12:55:14 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-changed (25; unit: hacluster-placement/0) hook), waited 24s, held 6s
2022-04-16 12:56:38 unit-placement-1: placement/1 uniter (run relation-joined (249; unit: hacluster-placement/1) hook), waited 19s, held 5s
2022-04-16 12:57:13 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-changed (25; unit: hacluster-placement/0) hook), waited 12s, held 5s
2022-04-16 12:59:48 unit-placement-1: placement/1 uniter (run relation-changed (249; unit: hacluster-placement/1) hook), waited 9s, held 8s
2022-04-16 13:01:12 unit-nrpe-42: nrpe/42 uniter (run relation-joined (250; unit: hacluster-placement/1) hook), waited 13s, held 11s
2022-04-16 13:01:35 unit-nrpe-42: nrpe/42 uniter (run relation-changed (250; unit: hacluster-placement/1) hook), waited 12s, held 11s

Also this unit wasn't the leader:

2022-04-16 12:46:59 DEBUG juju.worker.uniter.relation statetracker.go:221 unit "hacluster-placement/1" (leader=false) entered scope for relation "hacluster-placement:hanode"

This prevents from running this piece of code[0]:

...
    if configure_corosync():
        try_pcmk_wait()
        if is_leader():
            run_initial_setup() #<---!!
...

the function run_initial_setup() is the one in charge of disabling stonith[1] and due to the peers relation described above this unit was being configured as a single node cluster:

$ journalctl --file ../journal/660a5f12c3b64778a026dc895e3d6c09/system.journal | grep 'adding new UDPU member'
abr 16 08:46:15 juju-d527c3-1-lxd-9 corosync[27385]: [TOTEM ] adding new UDPU member {10.246.165.57}
abr 16 08:51:34 juju-d527c3-1-lxd-9 corosync[35698]: [TOTEM ] adding new UDPU member {10.246.165.57}

^ that's the journal file for the machine 1/lxd/9, and in comparison this is how that grep line looks like for nova-cloud-controller/0:

$ journalctl --file journal/c5b4fcd7022748cda9f41e1f6f6df7b9/system.journal | grep 'adding new UDPU member'
abr 16 08:46:23 juju-d527c3-0-lxd-7 corosync[30678]: [TOTEM ] adding new UDPU member {10.246.164.247}
abr 16 08:51:50 juju-d527c3-0-lxd-7 corosync[39528]: [TOTEM ] adding new UDPU member {10.246.164.247}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]: [TOTEM ] adding new UDPU member {10.246.167.82}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]: [TOTEM ] adding new UDPU member {10.246.165.194}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]: [TOTEM ] adding new UDPU member {10.246.164.247}

So I believe this is not a charm's issue, but a juju issue, not necessarily a bug, but this could be more related to the hooks still being processed in the queue.

[0] https://github.com/openstack/charm-hacluster/blob/master/hooks/hooks.py#L218-L221
[1] https://github.com/openstack/charm-hacluster/blob...

Read more...

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Per reviewing Felipe's latest analysis, combined with the previous look - I agree that this appears to be a juju issue. There could be additional hooks left in the queue, but I think the juju devs should take a look to validate/invalidate that fact.

Changed in charm-hacluster:
status: New → Invalid
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I have done significant work refactoring hacluster to overcome that issue. There are several specific conditions that result in "vip not yet configured". Maybe the issue can be addressed by the following patches:

https://review.opendev.org/c/openstack/charm-hacluster/+/818996
https://review.opendev.org/c/openstack/charm-hacluster/+/815755

Both patches are ideal to be used together, not a hard dependency, but they complement each other

Revision history for this message
Liam Young (gnuoy) wrote :

In the deployment I looked at two vips were requested:

juju config aodh vip
10.246.172.210 10.246.168.210

aodh translated this into a request for two different resources (from ha relation data0:

"res_aodh_8486566_vip": " params ip=\"10.246.168.210\" meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\"",
"res_aodh_eth0_vip": " params ip=\"10.246.172.210\" nic=\"eth0\" cidr_netmask=\"24\" meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op
  monitor timeout=\"20s\" interval=\"10s\" depth=\"0\"",
...

The hacluster charm then creates the new resources *but* with a new name for res_aodh_eth0_vip:

$ sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-e310e9-1-lxd-0 (version 2.1.2-ada5c3b36e2) - partition with quorum
  * Last updated: Fri Sep 2 14:57:12 2022
  * Last change: Thu Sep 1 02:22:01 2022 by hacluster via crmd on juju-e310e9-1-lxd-0
  * 3 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ juju-e310e9-0-lxd-0 juju-e310e9-1-lxd-0 juju-e310e9-2-lxd-0 ]

Full List of Resources:
  * Resource Group: grp_aodh_vips:
    * res_aodh_272179f_vip (ocf:heartbeat:IPaddr2): Started juju-e310e9-1-lxd-0
    * res_aodh_8486566_vip (ocf:heartbeat:IPaddr2): Started juju-e310e9-1-lxd-0
  * Clone Set: cl_res_aodh_haproxy [res_aodh_haproxy]:
    * Started: [ juju-e310e9-0-lxd-0 juju-e310e9-1-lxd-0 juju-e310e9-2-lxd-0 ]

So both vips are configured and working but the resource name has changed.

$ ping -c1 10.246.172.210
PING 10.246.172.210 (10.246.172.210) 56(84) bytes of data.
64 bytes from 10.246.172.210: icmp_seq=1 ttl=62 time=1.04 ms

--- 10.246.172.210 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.041/1.041/1.041/0.000 ms

$ ping -c1 10.246.168.210
PING 10.246.168.210 (10.246.168.210) 56(84) bytes of data.
64 bytes from 10.246.168.210: icmp_seq=1 ttl=63 time=0.613 ms

--- 10.246.168.210 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.613/0.613/0.613/0.000 ms

I think the issue is that the charm sets vip_iface by default to eth0 and that causes this issue. TBH I thought the vip_iface option had been removed long ago.

Revision history for this message
Liam Young (gnuoy) wrote :
Download full text (4.4 KiB)

On reflection I think comments 7 and 8 are correct and it's a juju bug. Looking into it showed that unit hacluster-aodh/1 was missing from the hanode relationship with hacluster-aodh/0 but only from hacluster-aodh/0's point of view. Inspecting the relationship from hacluster-aodh/1's point of view correctly shows both peer nodes (hacluster-aodh/0 and hacluster-aodh/2).

juju version: 2.9.33-ubuntu-amd64

$ juju run --application hacluster-aodh "relation-ids hanode" [11/51]
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/1
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/2
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/0

$ juju run --application hacluster-aodh "relation-list -r hanode:8"
- Stdout: |
    hacluster-aodh/2
  UnitId: hacluster-aodh/0
- Stdout: |
    hacluster-aodh/0
    hacluster-aodh/2
  UnitId: hacluster-aodh/1
- Stdout: |
    hacluster-aodh/0
    hacluster-aodh/1
  UnitId: hacluster-aodh/2

$ juju status aodh
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud/default 2.9.33 unsupported 08:46:42Z

SAAS Status Store URL
grafana active foundations-maas admin/lma-maas.grafana
graylog active foundations-maas admin/lma-maas.graylog
nagios active foundations-maas admin/lma-maas.nagios
prometheus active foundations-maas admin/lma-maas.prometheus

App Version Status Scale Charm Channel Rev Exposed Message
aodh 14.0.0 active 3 aodh yoga/stable 77 no Unit is ready
aodh-mysql-router 8.0.30 active 3 mysql-router 8.0/stable 35 no Unit is ready
filebeat 6.8.23 active 3 filebeat candidate 38 no Filebeat ready.
hacluster-aodh waiting 3 hacluster edge 109 no Resource: res_aodh_272179f_vip not yet configured
logrotated active 3 logrotated candidate 7 no Unit is ready.
nrpe active 3 nrpe candidate 94 no Ready
prometheus-grok-exporter maintenance 3 prometheus-grok-exporter candidate 8 no Installing software
public-policy-routing active 3 advanced-routing candidate 11 no Unit is ready
telegraf active 3 telegraf candidate 54 no Monitoring ceph-osd/2 (source version/commit 76901fd)

Unit Workload Agent Machine Public address Ports Message
aodh/0* active idle 0/lxd/0 10.246.165.92 8042/tcp Unit is ready
  aodh-mysql-router/0* active idle 10.246.165.92 Unit is ready
  filebeat/30 active idle 10.246.165.92 Filebeat ready.
  hacluster-aodh/0* waiting idle 10.246.165.92 Resource: res_aodh_272179f_vip not yet configured
  logrotated/24 active idle 10.246.165.92 Unit is ready.
  nrpe/36 active idle 10.246.165.92 icmp,5666/tcp Ready
  prometheus-grok-exporter/31 active idle 10.246.165.92 9144/tcp Unit is ready
  public-policy-routing/13 active idle 10.246.165.92 Unit is ready
  telegraf/29 active idle 10.246.165.92 9103/tcp Monitoring aodh/0 (source version/commit 76901fd)
aodh/1 active idle 1/lxd/0 10.246.166.208 8042/tcp Unit is ready
  aodh-mysql-router/1 active idle 10.246.166.208 Unit is ready
  filebeat/31 active idle 10.246.166.208 Filebeat ready.
  hacluster-aodh/1 waiting idle 10.246.166.208 Resource: res_aodh_272179f_vip not yet configured
  logrotated/25 active idle 10.246.166.208 Unit is ready.
  nrpe/37 active idle 10.246.166.208 icmp,5666/tcp Ready
  prometheus-grok-exporter/30 active idle 10.246.166...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

Ideally we'd get the following info:

juju dump-db
juju show-status-log -n 100 hacluster-aodh/N
juju show-status-log -n 100 aodh/N
juju status --relations --format yaml

We need the db dump to look at the applications, relations, units, relationscopes, unitstates collections. Maybe also settings.

Without the above it will be difficult to start trying to understand what's going on.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote (last edit ):

@Ian, the last few occurrences have almost all of this info.

Testrun https://solutions.qa.canonical.com/testruns/testRun/e896684f-db39-4f59-b6ff-2453fd45cf08, has a juju dump-db here:
https://oil-jenkins.canonical.com/artifacts/e896684f-db39-4f59-b6ff-2453fd45cf08/generated/generated/openstack/juju-dump-db-openstack-2022-09-04-18.40.11.tar.gz

And juju crashdump here:
https://oil-jenkins.canonical.com/artifacts/e896684f-db39-4f59-b6ff-2453fd45cf08/generated/generated/openstack/juju-crashdump-openstack-2022-09-04-18.40.11.tar.gz

The crashdump includes the show-status-log under the machine directory (hacluster-octavia/2 on 1/lxd/8 in this case). We do not collect `juju status --relations`, but we do have `juju_introspection/juju_engine_report-unit...`, which seems to include info about the relations.

An overview of all the collected logs and configs for this testrun can be found here:
https://oil-jenkins.canonical.com/artifacts/e896684f-db39-4f59-b6ff-2453fd45cf08/index.html

Changed in juju:
status: New → Triaged
importance: Undecided → High
tags: added: cdo-qa
Revision history for this message
Jeffrey Chang (modern911) wrote :

SQA has ~10 occurrences recently, and hacluster-manila-ganesha contributes most of those. We removed manila-ganesha for now.

The crashdump link for one of them, https://oil-jenkins.canonical.com/artifacts/134a886d-1270-443f-9962-b9326879f887/generated/generated/openstack/juju-crashdump-openstack-2023-07-17-08.13.53.tar.gz

hacluster-manila-ganesha blocked 3 hacluster 2.0.3/stable 113 no Resource: res_ganesha_e817d5c_vip not running

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.