private-address not refreshed in relation-data after binding change

Bug #1961448 reported by Rodrigo Barbieri
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
High
Joseph Phillips
OpenStack HA Cluster Charm
Fix Released
Undecided
Rodrigo Barbieri

Bug Description

Juju version used: 2.9.12

On a fully functional deployment where hacluster has the correct binding for the hanode endpoint (therefore matching the IP assigned to the unit), changing the binding to an incorrect one (by running juju bind hacluster <wrong_binding> --force) expectedly causes network-get to fail and hanode-relation-changed hook failure, resulting in failure to write the IP to the ring0_addr properties in corosync.conf because the private-address property disappears from the relation-data (due to failure of network-get due to incorrect binding).

Now, setting the binding back to the correct one (through juju bind hacluster <correct_binding>) restores the network-get functionality, but it does not restore the missing private-address property from the relation-data. Therefore the hanode-relation-changed hook failure persists and the ring0_addr still cannot be written to corosync.conf because the private-address property is not found in the relation-data.

How to force refresh the relation-data to re-read parameters from network-get ?

As I understand, the properties private-address, ingress-address and egress-subnets are "essential" properties that are present in every endpoint, as long as network-get command is successful.

Is something blocking the relation-data to being refreshed or re-querying network-get ? like a hook error or blocked state?

Things I have tried:

1) First I tried smoothing out the errors from the wrong binding change until status was clear and back to active/idle, before invoking "juju bind hacluster <correct_binding>", such as:

a) juju resolved --no-retry
b) writing ring0_addr values in corosync.conf manually

Still, changing the binding to the correct one resulted in errors due to the lack of private-address property.

2) With the correct binding now set, I then tried to refresh the property and overcome the errors in several ways:

a) juju resolved --no-retry
b) writing ring0_addr values in corosync.conf manually
c) setting the private-address properties manually through relation-set
d) restarting jujud
e) restarting the lxd container

None of those would work, and despite having set the property manually, the code at [0] still re-read "None" from the private-address properties in the relation-data as if they weren't set.

[0] https://github.com/juju/charm-helpers/blob/446cbfdad83e15b5cfd20f862d3c3b5b1956b998/charmhelpers/contrib/hahelpers/cluster.py#L187

description: updated
Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Phillips (manadart)
milestone: none → 2.9.26
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

quick update, I repeated my tests now doing relation-set of the ingress-address and egress-subnets properties as well, logs still showed "None" being read from the relation, I further insisted on "juju resolved --no-retry" and saw the properties now being read successfully. A juju config command flipping debug value later broke it again, but it healed itself afterwards.

So right now it seems the most consistent workaround is to apply the properties manually through relation-set and insist on "juju resolved --no-retry" until it finally works. Still, a bugfix is needed to force the network-get to be invoked and update the properties.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote (last edit ):

<comment in wrong LP>

Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Triaged
Changed in juju:
milestone: 2.9.26 → 2.9.27
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Hi @Joseph could you please post the PR link here? Thanks in advance

Changed in juju:
milestone: 2.9.27 → 2.9.28
tags: added: sts
Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Joseph Phillips (manadart) wrote :

As we discussed, the logic exists to update network relation data upon rebind.

I tried to reproduce this and got the expected behaviour on MAAS.

Spaces:
https://pastebin.canonical.com/p/H2dVddFqnb/

I deployed mariadb bound to space-default, and related it to mediawiki. Relation data looked like this in the DB:
https://pastebin.canonical.com/p/XPqNzc5YxK/

I rebound mariadb:
https://pastebin.canonical.com/p/fqXBxRGqbX/

Relation data changed as expected:
https://pastebin.canonical.com/p/ShVHpWtRsd/

This behaviour is triggered by the agent itself in the config-changed event following rebind. This *could* be blocked if the charm was in an error state requiring resolution, but apart from that I'd need more to go on. The happy path appears to work as designed.

Changed in juju:
milestone: 2.9.28 → 2.9.29
description: updated
Revision history for this message
Joseph Phillips (manadart) wrote :

I ran the same steps as above with 2.9.12 and got the same result.

Changed in juju:
milestone: 2.9.29 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)
Changed in charm-hacluster:
status: New → In Progress
Felipe Reyes (freyes)
Changed in charm-hacluster:
assignee: nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
milestone: none → 22.04
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/836887
Committed: https://opendev.org/openstack/charm-hacluster/commit/d54de3d3464352ca07e4b9d9f6a5c8350464b29b
Submitter: "Zuul (22348)"
Branch: master

commit d54de3d3464352ca07e4b9d9f6a5c8350464b29b
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Apr 6 18:42:13 2022 -0300

    Prevent errors when private-address=None

    Whenever a peer returns None as its IP, it results in
    misconfiguration in corosync.conf, which results in
    a series of cascading hook errors that are difficult to
    sort out.

    More specifically, this usually happens when network-get
    does not work for the current binding. The main problem
    is that when changing bindings, a hook fires before the
    network-get data is updated. This hook fails and prevents
    the network-get from being re-read.

    This patch changes the code behavior to ignore None IP
    entries, therefore gracefully exiting and deferring further
    configuration due to insufficient number of peers when that
    happens, so that a later hook can successfully read the IP
    from the relation and set the IPs correctly in corosync.

    Closes-bug: #1961448
    Change-Id: I5ed140a17e184fcf6954d0f66e25f74564bd281c

Changed in charm-hacluster:
status: In Progress → Fix Committed
Changed in charm-hacluster:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (stable/focal)

Fix proposed to branch: stable/focal
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/841588

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (stable/focal)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/841588
Committed: https://opendev.org/openstack/charm-hacluster/commit/07b7e5e367bde8d15ea7a2c1b631038c73158217
Submitter: "Zuul (22348)"
Branch: stable/focal

commit 07b7e5e367bde8d15ea7a2c1b631038c73158217
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Apr 6 18:42:13 2022 -0300

    Prevent errors when private-address=None

    Whenever a peer returns None as its IP, it results in
    misconfiguration in corosync.conf, which results in
    a series of cascading hook errors that are difficult to
    sort out.

    More specifically, this usually happens when network-get
    does not work for the current binding. The main problem
    is that when changing bindings, a hook fires before the
    network-get data is updated. This hook fails and prevents
    the network-get from being re-read.

    This patch changes the code behavior to ignore None IP
    entries, therefore gracefully exiting and deferring further
    configuration due to insufficient number of peers when that
    happens, so that a later hook can successfully read the IP
    from the relation and set the IPs correctly in corosync.

    Closes-bug: #1961448
    Change-Id: I5ed140a17e184fcf6954d0f66e25f74564bd281c
    (cherry picked from commit d54de3d3464352ca07e4b9d9f6a5c8350464b29b)

tags: added: in-stable-focal
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.