Attempts to write leadership settings when not the leader during relation-changed hooks

Bug #1654116 reported by Francis Ginther
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
Fix Committed
Undecided
Unassigned
Canonical Juju
Invalid
High
Unassigned
2.1
Invalid
High
Unassigned
Charm Helpers
Invalid
High
Unassigned
Landscape Server
Fix Released
High
Francis Ginther
ceilometer (Juju Charms Collection)
Invalid
Undecided
Unassigned
juju (Ubuntu)
Invalid
Undecided
Unassigned
rabbitmq-server (Juju Charms Collection)
Invalid
High
Unassigned

Bug Description

Observed this while deploying with landscape autopilot, version 16.11~bzr10796+jenkins3331-2.

The cloud being deployed was using ceph/ceph with 6 nodes and HA enabled.

Two of the three rabbitmq units failed with "error: cannot write leadership settings: cannot write settings: not the leader". The two Tracebacks are:

[rabbitmq-server-0.log]
2017-01-04 22:22:57 INFO amqp-relation-changed Setting policy "HA" for pattern "^(?!amq\\.).*" to "{\"ha-mode\": \"all\", \"ha-sync-mode\": \"automatic\"}" with priority "0" ...
2017-01-04 22:25:38 INFO amqp-relation-changed error: cannot write leadership settings: cannot write settings: not the leader
2017-01-04 22:25:38 INFO amqp-relation-changed Traceback (most recent call last):
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/amqp-relation-changed", line 706, in <module>
2017-01-04 22:25:38 INFO amqp-relation-changed hooks.execute(sys.argv)
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/core/hookenv.py", line 715, in execute
2017-01-04 22:25:38 INFO amqp-relation-changed self._hooks[hook_name]()
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/amqp-relation-changed", line 217, in amqp_changed
2017-01-04 22:25:38 INFO amqp-relation-changed relation_settings=relation_settings)
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/contrib/peerstorage/__init__.py", line 263, in peer_store_and_set
2017-01-04 22:25:38 INFO amqp-relation-changed relation_name=peer_relation_name)
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/contrib/peerstorage/__init__.py", line 198, in peer_store
2017-01-04 22:25:38 INFO amqp-relation-changed relation_settings={key: value})
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/contrib/peerstorage/__init__.py", line 135, in relation_set
2017-01-04 22:25:38 INFO amqp-relation-changed return leader_set(settings=relation_settings, **kwargs)
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/core/hookenv.py", line 837, in inner_translate_exc2
2017-01-04 22:25:38 INFO amqp-relation-changed return f(*args, **kwargs)
2017-01-04 22:25:38 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/core/hookenv.py", line 890, in leader_set
2017-01-04 22:25:38 INFO amqp-relation-changed subprocess.check_call(cmd)
2017-01-04 22:25:38 INFO amqp-relation-changed File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2017-01-04 22:25:38 INFO amqp-relation-changed raise CalledProcessError(retcode, cmd)
2017-01-04 22:25:38 INFO amqp-relation-changed subprocess.CalledProcessError: Command '['leader-set', 'amqp:62_password=VGYqpSqts4R39S9rcJrSwrB7s9ygd2Xp8cnSwcxbTSRKwBjznhHy7fF6247CCRHC']' returned non-zero exit status 1
2017-01-04 22:25:38 ERROR juju.worker.uniter.operation runhook.go:107 hook "amqp-relation-changed" failed: exit status 1
2017-01-04 22:25:38 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet

[rabbitmq-server-1.log]
2017-01-04 22:13:54 INFO amqp-relation-changed error: cannot write leadership settings: cannot write settings: not the leader
2017-01-04 22:13:54 INFO amqp-relation-changed Traceback (most recent call last):
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/amqp-relation-changed", line 706, in <module>
2017-01-04 22:13:54 INFO amqp-relation-changed hooks.execute(sys.argv)
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/charmhelpers/core/hookenv.py", line 715, in execute
2017-01-04 22:13:54 INFO amqp-relation-changed self._hooks[hook_name]()
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/amqp-relation-changed", line 181, in amqp_changed
2017-01-04 22:13:54 INFO amqp-relation-changed admin=settings.get('admin', False))
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/amqp-relation-changed", line 122, in configure_amqp
2017-01-04 22:13:54 INFO amqp-relation-changed password = rabbit.get_rabbit_password(username)
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/rabbit_utils.py", line 585, in get_rabbit_password
2017-01-04 22:13:54 INFO amqp-relation-changed peer_store(_key, _password)
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/charmhelpers/contrib/peerstorage/__init__.py", line 198, in peer_store
2017-01-04 22:13:54 INFO amqp-relation-changed relation_settings={key: value})
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/charmhelpers/contrib/peerstorage/__init__.py", line 135, in relation_set
2017-01-04 22:13:54 INFO amqp-relation-changed return leader_set(settings=relation_settings, **kwargs)
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/charmhelpers/core/hookenv.py", line 837, in inner_translate_exc2
2017-01-04 22:13:54 INFO amqp-relation-changed return f(*args, **kwargs)
2017-01-04 22:13:54 INFO amqp-relation-changed File "/var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/charmhelpers/core/hookenv.py", line 890, in leader_set
2017-01-04 22:13:54 INFO amqp-relation-changed subprocess.check_call(cmd)
2017-01-04 22:13:54 INFO amqp-relation-changed File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2017-01-04 22:13:54 INFO amqp-relation-changed raise CalledProcessError(retcode, cmd)
2017-01-04 22:13:54 INFO amqp-relation-changed subprocess.CalledProcessError: Command '['leader-set', 'ceilometer.passwd=4rcYrk2FfPNXFVgghdLtpC4VRCyBb4smXKFNHdwFxxdgsfqSrLy85WwW3MCCdPxM']' returned non-zero exit status 1
2017-01-04 22:13:54 ERROR juju.worker.uniter.operation runhook.go:107 hook "amqp-relation-changed" failed: exit status 1
2017-01-04 22:13:54 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet

The leader was rabbitmq-server/2. See the juju-status.yaml in the attached log tar.gz. All logs from the rabbitmq-server units are attached.

Revision history for this message
Francis Ginther (fginther) wrote :
Changed in landscape:
milestone: none → 16.12
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Please add a log analyser issue for this.

tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Revision history for this message
David Ames (thedac) wrote :

This would appear to be a change in Juju behavior for leader-set in juju 2.1b3.

Any charm that uses peer storage will be affected by this.

The charm-helper code that is affected [1] checks for the NotImplementedError. Which is dependent on leader_set [2] throwing NotImplimentedError. Which is checking for OSError. It is very likely that something other than OSError is returned.

We do not appear to be checking for leadership in our charms before attempting leader-set. This was an extension of the old peer storage that straddled the initial implementation of leadership in juju.

It may now be safe to assume leadership functionality is available which may clean some of this up. But we may now need to check is_leader before any leader-set commands are run.

[1] http://bazaar.launchpad.net/~charm-helpers/charm-helpers/devel/view/head:/charmhelpers/contrib/peerstorage/__init__.py#L133
[2] http://bazaar.launchpad.net/~charm-helpers/charm-helpers/devel/view/head:/charmhelpers/core/hookenv.py#L893

Revision history for this message
David Ames (thedac) wrote :

Juju 1.25.9 seems to behave the same. I can't find a .8 env to test against.

# leader-set poke=1 ; echo $?
error: cannot write leadership settings: cannot write settings: not the leader
1

If that is run in python we get the same subprocess.CalledProcessError which is missed by the check for OSError.

The charms will need to gate on is_leader.

Changed in charm-helpers:
status: New → Triaged
Changed in rabbitmq-server (Juju Charms Collection):
status: New → Triaged
importance: Undecided → High
Changed in charm-helpers:
importance: Undecided → High
Changed in rabbitmq-server (Juju Charms Collection):
milestone: none → 17.01
Revision history for this message
Francis Ginther (fginther) wrote :
Download full text (3.4 KiB)

Added ceilometer to the bug as I found it there too:

[from unit-ceilometer-0.log]
2017-01-05 21:17:54 INFO cluster-relation-joined error: cannot write leadership settings: cannot write settings: not the leader
2017-01-05 21:17:54 INFO cluster-relation-joined Traceback (most recent call last):
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/cluster-relation-joined", line 390, in <module>
2017-01-05 21:17:54 INFO cluster-relation-joined hooks.execute(sys.argv)
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/core/hookenv.py", line 715, in execute
2017-01-05 21:17:54 INFO cluster-relation-joined self._hooks[hook_name]()
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1817, in wrapped_f
2017-01-05 21:17:54 INFO cluster-relation-joined restart_functions)
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/core/host.py", line 524, in restart_on_change_helper
2017-01-05 21:17:54 INFO cluster-relation-joined r = lambda_f()
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1816, in <lambda>
2017-01-05 21:17:54 INFO cluster-relation-joined (lambda: f(*args, **kwargs)), restart_map, stopstart,
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/cluster-relation-joined", line 218, in cluster_joined
2017-01-05 21:17:54 INFO cluster-relation-joined peer_store('shared_secret', get_shared_secret())
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/contrib/peerstorage/__init__.py", line 198, in peer_store
2017-01-05 21:17:54 INFO cluster-relation-joined relation_settings={key: value})
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/contrib/peerstorage/__init__.py", line 135, in relation_set
2017-01-05 21:17:54 INFO cluster-relation-joined return leader_set(settings=relation_settings, **kwargs)
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/core/hookenv.py", line 837, in inner_translate_exc2
2017-01-05 21:17:54 INFO cluster-relation-joined return f(*args, **kwargs)
2017-01-05 21:17:54 INFO cluster-relation-joined File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/charmhelpers/core/hookenv.py", line 890, in leader_set
2017-01-05 21:17:54 INFO cluster-relation-joined subprocess.check_call(cmd)
2017-01-05 21:17:54 INFO cluster-relation-joined File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2017-01-05 21:17:54 INFO cluster-relation-joined raise CalledProcessError(retcode, cmd)
2017-01-05 21:17:54 INFO cluster-relation-joined subprocess.CalledProcessError: Command '['leader-set', 'shared_secret=b96c0d39-8766-42b5-88d9-69...

Read more...

Revision history for this message
Francis Ginther (fginther) wrote :

Full log from previous comment.

Changed in autopilot-log-analyser:
status: New → Fix Committed
Revision history for this message
David Ames (thedac) wrote :

My original theory is wrong on both fronts.

First using leader-set on the non-leader node appears to have always exited with an error code, so this is not new.

Second our charms (at least rabbitmq and ceilometer) do gate based on leader. So we should never have leader-set executed if the node is not the juju elected leader.

This would start to point the finger at is-leader giving a false positive. But I am still collecting science.

Changed in landscape:
status: New → Confirmed
importance: Undecided → High
Changed in landscape:
assignee: nobody → Francis Ginther (fginther)
Revision history for this message
David Ames (thedac) wrote :

Has there been any change from juju 2.1beta3 to 2.1beta4?

I have yet to be able to reproduce this. And as stated in the previous comment, the charms do in fact gate based on leader before a leader-set command is executed.

Revision history for this message
Francis Ginther (fginther) wrote :

@thedac,

I have not seen this with beta4 yet after about two days worth of runs. It was happening about 1 in 5 runs, so maybe juju beta4 resloved this.

Revision history for this message
Francis Ginther (fginther) wrote :

@thedac,

Found a case of this with juju 2.1beta4 last night. This time with mysql, I've attached the logs from all of the mysql units.

Revision history for this message
David Ames (thedac) wrote :

This is a juju is-leader bug.

I have tipple checked that any call to leader-set is gated by an is-leader check in our charms. Specifically in rabbitmq-server, percona-cluster and ceilometer.

With the juju 2.1b3 and rabbitmq you can see that leadership is bouncing around between the three units. See the timestamps in the following:

rabbitmq-server-0/var/log/juju/unit-rabbitmq-server-0.log:2017-01-04 21:12:16 INFO juju-log Unknown hook leader-elected - skipping.
rabbitmq-server-0/var/log/juju/unit-rabbitmq-server-0.log:2017-01-04 21:47:22 INFO juju-log Unknown hook leader-elected - skipping.
rabbitmq-server-0/var/log/juju/unit-rabbitmq-server-0.log:2017-01-04 22:16:54 INFO juju-log Unknown hook leader-elected - skipping.
rabbitmq-server-0/var/log/juju/unit-rabbitmq-server-0.log:2017-01-04 22:25:38 INFO amqp-relation-changed subprocess.CalledProcessError: Command '['leader-set', 'amqp:62_password=VGYqpSqts4R39S9rcJrSwrB7s9ygd2Xp8cnSwcxbTSRKwBjznhHy7fF6247CCRHC']' returned non-zero exit status 1

rabbitmq-server-1/var/log/juju/unit-rabbitmq-server-1.log:2017-01-04 22:01:25 INFO juju-log Unknown hook leader-elected - skipping.
rabbitmq-server-1/var/log/juju/unit-rabbitmq-server-1.log:2017-01-04 22:13:54 INFO amqp-relation-changed subprocess.CalledProcessError: Command '['leader-set', 'ceilometer.passwd=4rcYrk2FfPNXFVgghdLtpC4VRCyBb4smXKFNHdwFxxdgsfqSrLy85WwW3MCCdPxM']' returned non-zero exit status 1

rabbitmq-server-2/var/log/juju/unit-rabbitmq-server-2.log:2017-01-04 21:39:21 INFO juju-log Unknown hook leader-elected - skipping.

With juju 2.1b4 and percona-cluster unit 0 is the leader but some time goes by before it attempts leader-set. At the end unit 2 takes over leadership.

mysql-0/var/log/juju/unit-mysql-0.log:2017-01-12 06:20:33 INFO juju-log Unknown hook leader-elected - skipping.
mysql-0/var/log/juju/unit-mysql-0.log:2017-01-12 06:35:01 DEBUG juju-log cluster:2: Leader unit - bootstrap required=True
mysql-0/var/log/juju/unit-mysql-0.log:2017-01-12 06:35:28 DEBUG juju-log cluster:2: Leader unit - bootstrap required=False
mysql-0/var/log/juju/unit-mysql-0.log:2017-01-12 06:50:55 INFO shared-db-relation-changed subprocess.CalledProcessError: Command '['leader-set', 'shared-db:54_access-network=']' returned non-zero exit status 1

mysql-2/var/log/juju/unit-mysql-2.log:2017-01-12 06:51:43 INFO juju-log Unknown hook leader-elected - skipping.

There are 4 possible problems as I see it:

1) is-leader is giving a false positive
2) is-leader is not in the PATH when is-leader is called in the charms
3) A race during leader election in which one or more units believe they are the leader
4) leader-set fails during a leader election

Changed in charm-helpers:
status: Triaged → Invalid
Changed in ceilometer (Juju Charms Collection):
status: New → Invalid
Changed in rabbitmq-server (Juju Charms Collection):
status: Triaged → Invalid
Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

@thedac, I wonder if bug #1657245 might be related. It also happened with the rabbit charm and looks like it's a split brain issue (/0 and /1 formed a cluster, /2 is the juju leader and went to lala land).

Ryan Beisner (1chb1n)
tags: added: uosci
Changed in landscape:
milestone: 16.12 → 17.01
Revision history for this message
James Page (james-page) wrote :

Switch juju-core task to juju.

affects: juju-core (Ubuntu) → juju (Ubuntu)
Revision history for this message
John A Meinel (jameinel) wrote :

The actual promise from Juju is that from the time you call 'is-leader' and get a True value, that you will have 30s before we would possibly return True to any other unit.

Internally the mechanism is that we obtain a lease (valid for 1 minute) and attempt to renew that lease every 30s (so the very latest time you could call is-leader is with 31s left, if we failed to renew the leadership token).

Is it possible that there is a gap from when you call "is-leader" until you actually do "leader-set" that is longer than 30s? I'm wondering if you're running into load issues causing us to fail to renew the leadership during normal operation.

We do persist who has leadership in the database, which means it should be stable across the API server restarting, etc. I haven't heard of a case where 'is-leader' can return true but leader-set would then immediately fail, but I can imagine that load could play a factor.

Changed in juju (Ubuntu):
assignee: nobody → Menno Smits (menno.smits)
Changed in juju:
milestone: none → 2.1-rc1
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Menno Smits (menno.smits)
Changed in juju (Ubuntu):
status: New → Invalid
assignee: Menno Smits (menno.smits) → nobody
Ian Booth (wallyworld)
Changed in juju:
assignee: Menno Smits (menno.smits) → nobody
Revision history for this message
Ian Booth (wallyworld) wrote :

Looking at the rabbitmq logs, it appears to me (I could be misinterpreting):

At 21:39:23 -> unit 2 becomes the leader (the leader-elected hook is run)
At 21:47:25 -> unit 0 becomes the leader (the leader-elected hook is run)
At 22:01:27 -> unit 1 becomes the leader (the leader-elected hook is run)
At 22:13:54 -> unit 1 thinks it is the leader and updates leader settings but Juju thinks it is not
At 22:17:06 -> unit 0 becomes the leader (the leader-elected hook is run)
At 22:22:57 -> unit 0 thinks it is the leader and updates leader settings but Juju thinks it is not

So it appears that Juju ran the leader elected hook on both unit 0 and unit 1 and these units tried to do stuff as leader. But Juju thinks unit 2 is leader.

Curiously, at 22:17:06 and 22:17:58, unit 2 successfully updates leader settings, even though it should not have been the leader anymore.

The charm doesn't appear to use leader elected or deposed hooks - it just symlinks to a common Python file. Juju still thinks that there are hooks though, so it runs the leader-elected hook and logs that. But I can't see any logs of where Juju runs the leader-deposed hook. Perhaps that's a clue? Perhaps that indicates that Juju is getting confused as to who the leader is? It would be useful to have a mongo dump to see what state the model is in.

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Ian Booth (wallyworld) wrote :

I'm thinking also that a controller log (at debug or even trace level) may be useful to see how the controller is handling leadership changes. Just the part of the log around the time of the issue and what leads up to it.

Revision history for this message
Ian Booth (wallyworld) wrote :

One possible root cause may be that if the machine running the unit is loaded, the time between checking for leadership and the lease expiring may have passed before the hook gets to write the leadership settings. It's very unusual that the unit leader changes so often as seen in the logs. This fact points to a possible load issue because the leader should only change if the current leader did not to get update its lease before it expired; if it doesn't that is likely due to load.

Revision history for this message
John A Meinel (jameinel) wrote :

I haven't dug particularly deeply. However if I do
 charm pull ceilometer
I get: cs:ceilometer-24
And then dig into the contents of:
 charmhelpers/contrib/peerstorage/__init__.py

I see that it has a function:
def leader_get():

which looks like it is supposed to be a compatibility function, so that old versions of the charm can just use peer relation settings to coordinate, and new versions of the charm can use leader-set and leader-get directly. However, if I'm reading through the code it does:

  if attribute:
    ...
        # If attribute not present in leader db, check if this unit has set
        # the attribute in the peer relation
        if not leader_settings:
            peer_setting = _relation_get(attribute=attribute, unit=local_unit(),
                                         rid=rid)
            if peer_setting:
                leader_set(settings={attribute: peer_setting})
                leader_settings = peer_setting

Nothing in there does any sort of check for "is_leader". It is possible that the original call to "leader_get" itself is done inside a "is_leader" check, but this function appears to be trying to "look for settings that might have been set by a leader, but fall back to a peer relation, and if found in a peer relation set them in the leader settings" which would only be safe to do *as* the leader.

Note also that peerstorage.__init__.py has a "relation_get" function which directly calls "leader_get" which (as noted) might try to call leader_set.

Similarly the function "peer_retrieve" tries to call relation_get, which may call leader_get which may ultimately call 'leader_set'.

There may be other issues, but the code as written appears to be unsafe to use.

It may be that a better writing would trap inside "leader_get" and only attempt to do a migration if "is_leader()" returned True?

Revision history for this message
John A Meinel (jameinel) wrote :

Sorry, I'm on crack, I missed line 70/71 which is exactly the 'check if I'm leader first':
    if not is_leader():
        return _leader_get(attribute=attribute)

Forgive my earlier rambling. I missed that line and then dug all over to see if it was trapped outside of that function.

Revision history for this message
John A Meinel (jameinel) wrote :

I'll note that 'peer_store' isn't safe to directly call, but looking at the traceback of the original description it is line ~217 of hooks/amqp-relation-changed which looks to be:
    # If this node is the elected leader then share our secret with other nodes
    if is_elected_leader('grp_ceilometer_vips'):
        peer_store('shared_secret', get_shared_secret())

And 'is_elected_leader' is just a call to 'is-leader' with a bunch of other stuff if 'is-leader' doesn't exist.

So I can't see anything functionally wrong with the code. It does, indeed, appear to be checking before it actually does the work.

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.1-rc1 → 2.1.0
Changed in juju:
importance: Critical → High
milestone: 2.1.0 → 2.2.0-alpha1
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.01 → 17.02
Changed in juju:
milestone: 2.2.0-alpha1 → none
Revision history for this message
Stuart Bishop (stub) wrote :

You are going to want some better instrumentation. The leader-elected hook is not run when the unit becomes leader. It is run some time after the unit has become leader, and as far as I know there is no guarantee that it is still the leader when it happens. I'd stick an 'assert is_leader()' in the leader_set wrapper, possibly by monkey patching charmhelpers.core.hookenv.leader_set()

Revision history for this message
Francis Ginther (fginther) wrote :

I was finally able to reproduce this issue with logging set to TRACE. The full logs of the deployment (basically /var/log/* from every unit and the bootstrap node) are available from https://private-fileshare.canonical.com/~fginther/juju/lp-1654116/unit-not-leader-with-trace-logs.tar.gz (it's nearly 50GB and only available within canonical, sorry). The logs for the bootstrap unit are under "landscape-0-inner-logs/bootstrap/var/log".

I did have to use machines with a higher core count to reproduce this issue (12 instead of 4 with 16GB of RAM). This helps support the theory that the unit is over-loaded. A 12-core system will spawn a lot more openstack API processes then a 4-core system due to the worker-multiplier option. This detail, combined with the fact that we are smooshing multiple services into the same physical machine with LXD containment means that a physical machine will be running multiple dozens of additional processes in the same memory footprint.

Changed in juju:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking this as Invalid for Juju.
The problem occurs under heavy load and as per comment # 17, there is nothing Juju can do about it. It behaves as expected.

You may find comment # 21 very helpful.

Changed in juju:
status: New → Invalid
Revision history for this message
Francis Ginther (fginther) wrote :

So, this is a problem with charm-helpers or the charms themselves? I'm kinda thinking that 'under load' is the exact condition that you want features like leadership to work. I'll ask @thedac to review the updates to this bug since his last comment and see if there are any next steps from the charms/charm-helpers end.

Maybe this is ultimately too difficult to fix, but I have a hard time accepting that this bug is invalid across the board.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Re: comment #21 and gating; if this is fully async, then gating is only going to tighten the race, but not eliminate it. i.e. code like:

    if is_leader():
        leader_set(...)

can still error as the leadership could be lost between the two calls. The only safe way of doing leadership code is probably something like:

    if is_leader():
        try:
            # do things you have to do before calling leader_set()
            leader_set(...) # if this passes, we are the leader
            # do other things that depend on being the leader
        except NotImplementedError: # because that's what bizarrely is returned by charmhelpers
            # we're not the leader after all (or something else ... can't really tell)
            if not is_leader():
                # we're not the leader any more, undo any damage
            else:
                raise # something else went wrong
        except OtherErrorsTheCodeMayHaveThrownIfWeAreInterested:
            # etc

Anyone spot any problems? If this is a pattern, we should try to get it into charmhelpers as a context manager or similar.

Revision history for this message
Chad Smith (chad.smith) wrote :

Worker multiplier set to 1.0 relieves memory pressure and may have resolved this issue.

Changed in landscape:
status: Confirmed → Incomplete
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.02 → 17.03
David Britton (dpb)
Changed in landscape:
status: Incomplete → Fix Committed
Changed in landscape:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.