1/3 hacluster units is in "Resource: res_manila_share_manila_share not yet configured"

Bug #1890505 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Manila-Ganesha Charm
New
Undecided
Unassigned

Bug Description

Looks like the manila-share resource have not been configured although it did get exposed over the relation between manila-ganesha and hacluster.

Observations:

* 1 of 3 units of manila-ganesha did not expose the resources to its hacluster unit;
* the leader hacluster unit have not even considered res_manila_share_manila_share creation based on the log:

    2020-08-05 17:42:52 DEBUG juju-log ha:42: Parsing cluster configuration using rid: ha:42, unit: manila-ganesha/0
...
    2020-08-05 17:41:27 DEBUG juju-log hanode:2: Configuring Resources: {}
    2020-08-05 17:43:07 DEBUG juju-log ha:42: Configuring Resources: {'res_ganesha_a3b980a_vip': 'ocf:heartbeat:IPaddr2'}
...
    2020-08-05 17:43:10 DEBUG ha-relation-changed ERROR: could not replace cib (rc=203)
    2020-08-05 17:43:10 DEBUG ha-relation-changed INFO: offending xml: <diff format="2">

juju status:

manila-ganesha/0* active idle 11 10.5.0.17 Unit is ready

  hacluster/0* active idle 10.5.0.17 Unit is ready and clustered
manila-ganesha/1 active idle 12 10.5.0.14 Unit is ready
  hacluster/2 active idle 10.5.0.14 Unit is ready and clustered
manila-ganesha/2 active idle 13 10.5.0.18 Unit is ready
  hacluster/1 waiting idle 10.5.0.18 Resource: res_manila_share_manila_share not yet configured

ubuntu@dmitriis-bastion:~$ juju ssh manila-ganesha/2

ubuntu@juju-dedf8c-zaza-a6cee85c0b7f-13:~$ sudo juju-run manila-ganesha/2 'relation-get -r 42 - hacluster/1'
clustered: "yes"
egress-subnets: 10.5.0.18/32
ingress-address: 10.5.0.18
private-address: 10.5.0.18

ubuntu@juju-dedf8c-zaza-a6cee85c0b7f-13:~$ sudo crm status
Stack: corosync
Current DC: juju-dedf8c-zaza-a6cee85c0b7f-11 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed Aug 5 19:36:35 2020
Last change: Wed Aug 5 17:43:08 2020 by root via cibadmin on juju-dedf8c-zaza-a6cee85c0b7f-11

3 nodes configured
1 resource configured

Online: [ juju-dedf8c-zaza-a6cee85c0b7f-11 juju-dedf8c-zaza-a6cee85c0b7f-12 juju-dedf8c-zaza-a6cee85c0b7f-13 ]

Full list of resources:

 res_ganesha_a3b980a_vip (ocf::heartbeat:IPaddr2): Started juju-dedf8c-zaza-a6cee85c0b7f-11

ubuntu@dmitriis-bastion:~$ juju run --unit manila-ganesha/0 'relation-get -r 42 - manila-ganesha/0'
corosync_bindiface: eth0
corosync_mcastport: "4440"
egress-subnets: 10.5.0.17/32
ingress-address: 10.5.0.17
json_colocations: '{"ganesha_with_vip": "inf: res_nfs_ganesha_nfs_ganesha grp_ganesha_vips",
  "manila_with_vip": "inf: res_manila_share_manila_share grp_ganesha_vips"}'
json_delete_resources: '["res_ganesha_ens3_vip"]'
json_resource_params: '{"res_ganesha_a3b980a_vip": " params ip=\"10.5.0.45\" meta
  migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\""}'
json_resources: '{"res_ganesha_a3b980a_vip": "ocf:heartbeat:IPaddr2"}'
private-address: 10.5.0.17

ubuntu@dmitriis-bastion:~$ juju run --unit manila-ganesha/1 'relation-get -r 42 - manila-ganesha/1'
egress-subnets: 10.5.0.14/32
ingress-address: 10.5.0.14
private-address: 10.5.0.14

ubuntu@dmitriis-bastion:~$ juju run --unit manila-ganesha/2 'relation-get -r 42 - manila-ganesha/2'
corosync_bindiface: eth0
corosync_mcastport: "4440"
egress-subnets: 10.5.0.18/32
ingress-address: 10.5.0.18
json_colocations: '{"ganesha_with_vip": "inf: res_nfs_ganesha_nfs_ganesha grp_ganesha_vips",
  "manila_with_vip": "inf: res_manila_share_manila_share grp_ganesha_vips"}'
json_delete_resources: '["res_ganesha_ens3_vip"]'
json_groups: '{"grp_ganesha_vips": "res_ganesha_a3b980a_vip"}'
json_resource_params: '{"res_ganesha_a3b980a_vip": " params ip=\"10.5.0.45\" meta
  migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\"", "res_manila_share_manila_share": " meta migration-threshold=\"INFINITY\"
  failure-timeout=\"5s\" op monitor interval=\"5s\"", "res_nfs_ganesha_nfs_ganesha":
  " meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor interval=\"5s\""}'
json_resources: '{"res_ganesha_a3b980a_vip": "ocf:heartbeat:IPaddr2", "res_manila_share_manila_share":
  "systemd:manila-share", "res_nfs_ganesha_nfs_ganesha": "systemd:nfs-ganesha"}'
json_systemd_services: '["manila-share"]'
private-address: 10.5.0.18

ubuntu@dmitriis-bastion:~$ juju run --unit manila-ganesha/0 'relation-get -r 42 - hacluster/0'
clustered: "yes"
egress-subnets: 10.5.0.17/32
ingress-address: 10.5.0.17
private-address: 10.5.0.17

ubuntu@dmitriis-bastion:~$ juju run --unit manila-ganesha/1 'relation-get -r 42 - hacluster/2'
clustered: "yes"
egress-subnets: 10.5.0.14/32
ingress-address: 10.5.0.14
private-address: 10.5.0.14

ubuntu@dmitriis-bastion:~$ juju run --unit manila-ganesha/2 'relation-get -r 42 - hacluster/1'
clustered: "yes"
egress-subnets: 10.5.0.18/32
ingress-address: 10.5.0.18
private-address: 10.5.0.18

# cluster_connected got called at ~17:42:06
root@juju-dedf8c-zaza-a6cee85c0b7f-13:~# grep -iP 'invoking.*?cluster_connected' /var/log/juju/unit-manila-ganesha-2.log
2020-08-05 17:42:06 INFO juju-log ha:42: Invoking reactive handler: reactive/manila_ganesha.py:108:cluster_connected
2020-08-05 17:42:21 INFO juju-log ceph:14: Invoking reactive handler: reactive/manila_ganesha.py:108:cluster_connected
2020-08-05 17:42:59 INFO juju-log manila-plugin:18: Invoking reactive handler: reactive/manila_ganesha.py:108:cluster_connected

root@juju-dedf8c-zaza-a6cee85c0b7f-13:~# journalctl -u manila-share.service | grep Stopped | tail -n5
Aug 05 17:36:19 juju-dedf8c-zaza-a6cee85c0b7f-13 systemd[1]: Stopped OpenStack Manila Share.
Aug 05 17:36:21 juju-dedf8c-zaza-a6cee85c0b7f-13 systemd[1]: Stopped OpenStack Manila Share.
Aug 05 17:36:22 juju-dedf8c-zaza-a6cee85c0b7f-13 systemd[1]: Stopped OpenStack Manila Share.
Aug 05 17:42:02 juju-dedf8c-zaza-a6cee85c0b7f-13 systemd[1]: Stopped OpenStack Manila Share.
Aug 05 17:42:10 juju-dedf8c-zaza-a6cee85c0b7f-13 systemd[1]: Stopped OpenStack Manila Share.

# The first time ha.available was set for manila-ganesha/2
root@juju-dedf8c-zaza-a6cee85c0b7f-13:~# grep ha.available /var/log/juju/unit-manila-ganesha-2.log | head -n1
2020-08-05 17:43:33 DEBUG juju-log ha:42: tracer: set flag ha.available

ubuntu@dmitriis-bastion:~$ juju run --application hacluster 'grep "Configuring Resources" /var/log/juju/unit-hacluster-*.log'
- Stdout: |
    2020-08-05 17:39:37 DEBUG juju-log hanode:2: Configuring Resources: {}
    2020-08-05 17:40:14 DEBUG juju-log hanode:2: Configuring Resources: {}
    2020-08-05 17:40:50 DEBUG juju-log hanode:2: Configuring Resources: {}
    2020-08-05 17:41:27 DEBUG juju-log hanode:2: Configuring Resources: {}
    2020-08-05 17:43:07 DEBUG juju-log ha:42: Configuring Resources: {'res_ganesha_a3b980a_vip': 'ocf:heartbeat:IPaddr2'}
  UnitId: hacluster/0
- ReturnCode: 1
  Stdout: ""
  UnitId: hacluster/1
- ReturnCode: 1
  Stdout: ""
  UnitId: hacluster/2

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (6.2 KiB)

1) hacluster/0 has never lost leadership during its lifetime (leader-elected got executed once and only on that unit);

2) the hacluster interface is container-scoped, therefore, the fact that manila-ganesha/1 has not exposed resource configuration does not matter;

3) the "Configuring Resources" log message appeared after the second ha-relation-changed got logged

* hacluster/0: 05 Aug 2020 17:37:47Z juju-unit executing running ha-relation-changed hook

* manila-ganesha/0:

2020-08-05 17:42:32 INFO juju-log ha:42: Invoking reactive handler: reactive/manila_ganesha.py:108:cluster_connected

* hacluster/0:
05 Aug 2020 17:42:51Z juju-unit executing running ha-relation-changed hook
2020-08-05 17:43:07 DEBUG juju-log ha:42: Configuring Resources: # ...

Chronologically, hacluster/0 learned about the resources exposed by manila-ganesha/0 at 05 Aug 2020 17:42:51Z. There weren't any -changed events associated with that relation after that.

juju show-status-log hacluster/0
Time Type Status Message
05 Aug 2020 17:36:14Z juju-unit executing running leader-elected hook
05 Aug 2020 17:36:20Z juju-unit executing running config-changed hook
05 Aug 2020 17:36:22Z workload maintenance Setting up corosync
05 Aug 2020 17:37:08Z juju-unit executing running start hook
05 Aug 2020 17:37:19Z workload active Unit is ready and clustered
05 Aug 2020 17:37:26Z juju-unit executing running ha-relation-joined hook
05 Aug 2020 17:37:47Z juju-unit executing running ha-relation-changed hook
05 Aug 2020 17:38:07Z juju-unit executing running hanode-relation-joined hook
05 Aug 2020 17:38:27Z juju-unit executing running hanode-relation-changed hook
05 Aug 2020 17:38:40Z workload blocked Insufficient peer units for ha cluster (require 3)
05 Aug 2020 17:38:48Z juju-unit executing running hanode-relation-joined hook
05 Aug 2020 17:39:08Z juju-unit executing running hanode-relation-changed hook
05 Aug 2020 17:41:39Z juju-unit idle
05 Aug 2020 17:42:51Z juju-unit executing running ha-relation-changed hook
05 Aug 2020 17:43:26Z juju-unit idle

juju show-status-log hacluster/1 --days 2
Time Type Status Message
05 Aug 2020 17:35:02Z workload waiting waiting for machine
05 Aug 2020 17:35:02Z juju-unit allocating
05 Aug 2020 17:35:02Z workload waiting installing agent
05 Aug 2020 17:35:06Z workload waiting agent initializing
05 Aug 2020 17:35:19Z workload maintenance installing charm software
05 Aug 2020 17:35:19Z juju-unit executing running install hook
05 Aug 2020 17:35:20Z workload maintenance Installin...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (5.6 KiB)

The manila-share resource is only exposed by manila-ganesha/2, not manila-ganesha/0 which is the one the leader hacluster unit is taking the resource configuration data from - only a vip resource got exposed by manila-ganesha/0:

sudo juju-run -r 42 --remote-unit manila-ganesha/0 hacluster/0 'JUJU_HOOK_NAME=ha-relation-changed hooks/ha-relation-changed'

rlwrap telnet localhost 4444

ubuntu@juju-dedf8c-zaza-a6cee85c0b7f-11:~$ rlwrap telnet localhost 4444
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(290)ha_relation_changed()
-> if not get_corosync_conf():

(Pdb) l
285 def ha_relation_changed():
286 import rpdb
287 rpdb.set_trace()
288 # Check that we are related to a principle and that
289 # it has already provided the required corosync configuration
290 -> if not get_corosync_conf():
291 log('Unable to configure corosync right now, deferring configuration',
292 level=INFO)
293 return
294
295 if relation_ids('hanode'):

(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(295)ha_relation_changed()
-> if relation_ids('hanode'):
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(296)ha_relation_changed()
-> log('Ready to form cluster - informing peers', level=DEBUG)
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(297)ha_relation_changed()
-> relation_set(relation_id=relation_ids('hanode')[0], ready=True)
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(305)ha_relation_changed()
-> if len(get_cluster_nodes()) < int(config('cluster_count')):
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(310)ha_relation_changed()
-> relids = relation_ids('ha') or relation_ids('juju-info')
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(311)ha_relation_changed()
-> if len(relids) == 1: # Should only ever be one of these
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(313)ha_relation_changed()
-> relid = relids[0]
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(314)ha_relation_changed()
-> units = related_units(relid)
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(315)ha_relation_changed()
-> if len(units) < 1:

(Pdb) l
310 relids = relation_ids('ha') or relation_ids('juju-info')
311 if len(relids) == 1: # Should only ever be one of these
312 # Obtain relation information
313 relid = relids[0]
314 units = related_units(relid)
315 -> if len(units) < 1:
316 log('No principle unit found, deferring configuration',
317 level=INFO)
318 return
319
320 unit = units[0]
(Pdb) relid
'ha:42'
(Pdb) units
['manila-ganesha/0']

(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-changed(320)ha_relation_changed()
-> unit = units[0]
(Pdb) n
> /var/lib/juju/agents/unit-hacluster-0/charm/hooks/ha-relation-chan...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

juju show-status-log manila-ganesha/0 --days 2
...
05 Aug 2020 17:42:19Z juju-unit executing running ha-relation-joined hook
05 Aug 2020 17:42:40Z juju-unit executing running ha-relation-changed hook

$ juju ssh manila-ganesha/0 'grep "cluster_connected" /var/log/juju/unit-manila-ganesha-0.log'
tracer: ++ queue handler reactive/manila_ganesha.py:108:cluster_connected
2020-08-05 17:42:32 INFO juju-log ha:42: Invoking reactive handler: reactive/manila_ganesha.py:108:cluster_connected

$ juju ssh manila-ganesha/0 'grep "juju-log ha:42: tracer: set flag ha.available" /var/log/juju/unit-manila-ganesha-0.log'
2020-08-05 17:42:42 DEBUG juju-log ha:42: tracer: set flag ha.available

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (4.9 KiB)

Two code-paths set relation data on the HA relation.

1) cluster_connected -> this_charm.configure_ha_resources -> configure_ha_resources -> hacluster.bind_resources
2) cluster_connected -> hacluster.manage_resources(crm)

I thought that the second call with a custom CRM object (where crm.resources == {}) might be a overriding the relation data but it turns out not to be the case base of "if v" here:

https://github.com/openstack/charm-interface-hacluster/blob/9ea447c296466ba9fdca1eb8e9752bbd6a75cc59/requires.py#L98
        relation_data = {
            'json_{}'.format(k): json.dumps(v, sort_keys=True)
            for k, v in crm.items() if v
        }

charms_openstack/charm/classes.py

    def configure_ha_resources(self, hacluster):
        """Inform the ha subordinate about each service it should manage. The
        child class specifies the services via self.ha_resources

        @param hacluster instance of interface class HAClusterRequires
        """
        RESOURCE_TYPES = {
            'vips': self._add_ha_vips_config,
            'haproxy': self._add_ha_haproxy_config,
            'dnsha': self._add_dnsha_config,
        }
        if self.ha_resources:
            for res_type in self.ha_resources:
                RESOURCE_TYPES[res_type](hacluster)
            hacluster.bind_resources(iface=self.config[IFACE_KEY]) # <-------- this

requires.py

    def manage_resources(self, crm): # <-------- this
        """
        Request for the hacluster to manage the resources defined in the
        crm object.

            res = CRM()
            res.primitive('res_neutron_haproxy', 'lsb:haproxy',
                          op='monitor interval="5s"')
            res.init_services('haproxy')
            res.clone('cl_nova_haproxy', 'res_neutron_haproxy')

            hacluster.manage_resources(crm)

        :param crm: CRM() instance - Config object for Pacemaker resources
        :returns: None
        """
        relation_data = {
            'json_{}'.format(k): json.dumps(v, sort_keys=True)
            for k, v in crm.items() if v
        }
        if data_changed('hacluster-manage_resources', relation_data):
            self.set_local(**relation_data)
            self.set_remote(**relation_data)

    def bind_resources(self, iface=None, mcastport=None):
        """Inform the ha subordinate about each service it should manage. The
        child class specifies the services via self.ha_resources

        :param iface: string - Network interface to bind to
        :param mcastport: int - Multicast port corosync should use for cluster
                                management traffic
        """
        if mcastport is None:
            mcastport = 4440
        resources_dict = self.get_local('resources')
        self.bind_on(iface=iface, mcastport=mcastport)
        if resources_dict:
            resources = relations.hacluster.common.CRM(**resources_dict) # <-------- this
            self.manage_resources(resources) # <-------- this

@reactive.when('ha.connected', 'ganesha-pool-configured',
        ...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

https://review.opendev.org/#/c/743212 patch set 7 addresses the issue but there is a different problem blocking it.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.