Constant leadership changes

Bug #1977798 reported by Haw Loeung
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel

Bug Description

Hi,

Running Juju 2.9.27, we're noticing a lot of leadership changes. We run cloud mirrors using the ubuntu-repository-cache charm. It syncs metadata from the leader unit and then triggers peers or non-leader units to sync from it to ensure that the exact same metadata copy is across the fleet of units in the same region.

We're seeing more and more leadership changes causing the syncing to fail.

| https://pastebin.canonical.com/p/YXKFNRN52T/

The bit of code in the charm that logs leadership changes is in the link below:

| https://bazaar.launchpad.net/~ubuntu-repository-cache-charmers/ubuntu-repository-cache/layer-ubuntu-repository-cache/view/head:/reactive/ubuntu_repository_cache.py#L302

Tags: canonical-is

Related branches

Haw Loeung (hloeung)
tags: added: canonical-is
description: updated
Barry Price (barryprice)
Changed in juju:
status: New → Confirmed
Revision history for this message
John A Meinel (jameinel) wrote :

So I investigated this by just creating a charm that has very simple properties:
class Ubuntu(charm.CharmBase):
    """The simplest of charms that just gets Ubuntu up and running.
    """

    def __init__(self, framework, *args):
        super().__init__(framework, *args)

...
        self.framework.observe(self.on.update_status, self._on_update_status)
...
        self.framework.observe(self.on.leader_settings_changed, self._on_settings_changed)
        self.framework.observe(self.on.leader_elected, self._on_elected)
...
    def _on_update_status(self, event):
        load1min, load5min, load15min = os.getloadavg()
        self.model.unit.status = model.ActiveStatus(
            'load: {:.2f} {:.2f} {:.2f}'.format(load1min, load5min, load15min))
        sub = subprocess.run('leader-get', capture_output=True)
        logging.info('update-status for %s, is-leader: %s:\nout: %r\nerr: %r',
                self.unit.name, self.unit.is_leader(), sub.stdout, sub.stderr)

...
    def _on_settings_changed(self, event):
        sub = subprocess.run('leader-get', capture_output=True)
        logging.info('leader-settings-changed for %s:\nout: %r\nerr: %r', self.unit.name, sub.stdout, sub.stderr)

    def _on_elected(self, event):
        sub = subprocess.run('leader-get', capture_output=True)
        logging.info('leader-elected for %s:\nout: %r\nerr: %r', self.unit.name, sub.stdout, sub.stderr)

I did see a leader-settings-change after upgrade-charm completed (which doesn't seem correct, but isn't strictly wrong nor should be happening often).

I also tested whether running `juju run --unit X -- leader-set leader_id=SAME_VALUE`

and I wasn't able to trigger a second leader-settings-changed event (so setting a field to the same value as it already has *shouldn't* trigger a leader-settings-changed event.)

I did also do the same thing as the linked charm where in update-status and in leader-elected it sets leader_id. In that case, I still don't see lots of calls to leader-settings-changed. (this is with a local test controller running a pre-2.9.30 Juju).
I'll try running it with 2.9.27 but I don't think we changed anything in this area.

Revision history for this message
John A Meinel (jameinel) wrote :

One other thing that I did try, was I stopped both the unit agents so that they couldn't be refreshing/claiming a lease, and then started the agent that had previously been the leader again first. In my case it was:

juju ssh 0 juju_stop_unit ubuntu-operator/0
juju ssh 1 juju_stop_unit ubuntu-operator/1
sleep 60
juju ssh 1 juju_start_unit ubuntu-operator/1
sleep 2
juju ssh 0 juju_start_unit ubuntu_operator/0

And I *did* see that it got elected a second time and it did trigger leader-settings-changed even though it was the same leader.
So it sounds like if there was connectivity issues, and some reason why unit agents aren't able to renew their leases, we would see the leader-settings changed that you have.

John A Meinel (jameinel)
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
importance: Undecided → High
status: Confirmed → In Progress
milestone: none → 2.9-next
Revision history for this message
Haw Loeung (hloeung) wrote :

Any progress on this?

Maybe some kind of grace period before leadership re-elections.

Also remember who the last leader was and having that unit have priority in leadership re-elections?

Revision history for this message
Haw Loeung (hloeung) wrote :
Download full text (3.3 KiB)

| ubuntu@ip-172-31-58-147:~$ grep -Ei 'leader_id|leader:' /var/log/juju/unit-ubuntu-repository-cache-4.log | tail -n 20
| 2022-07-12 20:37:21 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/9
| 2022-07-12 21:14:53 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/1
| 2022-07-12 21:20:29 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/1
| 2022-07-12 21:29:27 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/1
| 2022-07-12 21:30:21 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/1
| 2022-07-12 21:57:28 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/2
| 2022-07-12 22:15:41 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-12 22:26:30 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-12 22:31:09 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-12 22:50:18 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-12 22:51:21 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-12 22:52:00 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-12 22:52:31 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/9
| 2022-07-12 23:11:19 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/9
| 2022-07-12 23:11:59 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/1
| 2022-07-12 23:18:17 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/1
| 2022-07-12 23:45:42 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-12 23:54:43 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-13 00:04:30 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/5
| 2022-07-13 00:05:19 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-elected fired. This unit is the new leader: ubuntu-repository-cache/4

For this envi...

Read more...

Revision history for this message
Haw Loeung (hloeung) wrote :

Elsewhere, in another environment with HA juju controllers, in different public cloud (Azure, the other was AWS):

| 2022-07-12 21:52:52 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-settings-changed fired. New leader_id: ubuntu-repository-cache/1
| 2022-07-13 02:57:20 INFO unit.ubuntu-repository-cache/4.juju-log server.go:319 leader-elected fired. This unit is the new leader: ubuntu-repository-cache/4

Revision history for this message
Haw Loeung (hloeung) wrote :

LP:1984060 filed against agents dying and respawning causing leadership re-elections.

Revision history for this message
Ian Booth (wallyworld) wrote :

In juju 3.2 we are changing how we manage leadership which should make the current leader more stable. We don't plan to do any work on this in earlier versions of juju.

Changed in juju:
status: In Progress → Fix Committed
milestone: 2.9-next → 3.2.0
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.