IP address sometimes not set or incorrect on pebble_ready event

Bug #1929364 reported by Ben Hoyt
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Harry Pidcock

Bug Description

Per the report from "sed-i" in https://github.com/canonical/operator/issues/538, sometimes when he deployed and then added more units, occasionally the IP address is not ready when the pebble_ready arrives at the charm. Details from canonical/operator#538:

ENVIRONMENT

mickrok8s in multipass (4 cpus, 8G ram):

$ juju --version
2.9.0-ubuntu-amd64

$ microk8s kubectl version
Client Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-34+df7df22a741dbc", GitCommit:"df7df22a741dbc18dc3de3000b2393a1e3c32d36", GitTreeState:"clean", BuildDate:"2021-05-12T21:08:20Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

DESCRIPTION

When I deploy and then add 3 units (alertmanager https://github.com/sed-i/alertmanager-operator/tree/feature/pebblization, in my case), occasionally the ip address is not ready.
This does not happen if I manually (slowly) add the units one by one.

Expected: An IP address is ready (and correct) by the time pebble_ready fires.
Actual: Occasionally, when adding multiple units at once, IP address is not available (or old) when queried from within pebble_ready.

REPRODUCIBLE SCENARIO

While adding 3 units

Very consistently, the following code assigns None to bind_address, for 1-2 of the added units:

    relation = self.model.get_relation("replicas")
    bind_address = self.model.get_binding(relation).network.bind_address

Similarly, unit-get occasionally returns an empty string under the same circumstances:

    bind_address = check_output(["unit-get", "private-address"]).decode().strip()

When restarting the machine

I have 4 units running, and then suddenly I sudo reboot. When the application (alertmanager) starts up, bind_address returns the IP address from the previous boot.

Jon Seager (jnsgruk)
tags: added: sidecar-charn
tags: added: sidecar-charm
removed: sidecar-charn
Revision history for this message
John A Meinel (jameinel) wrote :

It makes sense that for pod-spec charms they might not have a bind address at the time the charm code runs, since if they haven't declared a pod spec yet, there is no workload pod.
However, the with sidecar charms, the pod is running both the charm container and the workload container, and by the time pebble_ready fires there should be a valid IP address for the pod. (I would actually expect us to have a valid IP by the time 'install' fires.)

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.9.4
Revision history for this message
Leon (sed-i) wrote :

Per suggestion of ~jnsgruk: `ip a` (via subprocess.check_output) shows a correct ip address while at the same time `bind_address` returns None.

Changed in juju:
milestone: 2.9.4 → 2.9.5
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

I suspect the root cause here is as per https://bugs.launchpad.net/bugs/1930649

There's a PR to fix the above bug
https://github.com/juju/juju/pull/13049

Hopefully retesting with the above fix will show it's solved.

Revision history for this message
Leon (sed-i) wrote :

In juju 2.9.5 (maybe [this](https://github.com/juju/juju/pull/13049) PR) some entries were automatically added to the peer data bag:
```
self.model.get_relation("replicas").data.keys() =
KeysView({
  <ops.model.Unit alertmanager-k8s/0>: {
    'egress-subnets': '10.152.183.222/32',
    'ingress-address': '10.152.183.222',
    'private-address': '10.152.183.222',
    'private_address': '10.1.157.125'}, # added by me
  <ops.model.Application alertmanager-k8s>: {}
})
```

Note the difference between:
1. `'private-address': '10.152.183.222'` - auto populated - application address
2. `'private_address': '10.1.157.125'` - populated manually by me - unit address

**Shouldn't the auto-populated `private-address` be the unit address instead of the app address?**

Model Controller Cloud/Region Version SLA Timestamp
dev-model my-ctrlr microk8s/localhost 2.9.5 unsupported 11:09:18-04:00

App Version Status Scale Charm Store Channel Rev OS Address Message
alertmanager-k8s active 1 alertmanager-k8s local 0 kubernetes 10.152.183.222

Unit Workload Agent Address Ports Message
alertmanager-k8s/0* active idle 10.1.157.125

Relation provider Requirer Interface Type Message
alertmanager-k8s:replicas alertmanager-k8s:replicas alertmanager-replica peer

Revision history for this message
Leon (sed-i) wrote (last edit ):

Also, bind_address occasionally returns None from within "on_peer_joined" event. AFAIU when in "on_peer_joined", an IP address should be guaranteed. Observed with Juju 2.9.5.

Changed in juju:
milestone: 2.9.5 → 2.9.6
Harry Pidcock (hpidcock)
Changed in juju:
assignee: nobody → Harry Pidcock (hpidcock)
status: Triaged → In Progress
Revision history for this message
Harry Pidcock (hpidcock) wrote :
Changed in juju:
milestone: 2.9.6 → 2.9.7
Harry Pidcock (hpidcock)
Changed in juju:
status: In Progress → Fix Committed
milestone: 2.9.7 → 3.0.0
milestone: 3.0.0 → 2.9.7
status: Fix Committed → Fix Released
Harry Pidcock (hpidcock)
Changed in juju:
milestone: 2.9.7 → 2.9.6
John A Meinel (jameinel)
Changed in juju:
milestone: 2.9.6 → 2.9.7
status: Fix Released → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Leon (sed-i) wrote :

This is still an issue with Juju 2.9.21

Revision history for this message
Ryan Barry (rbarry) wrote :

The issue present in Juju 2.9.21 seems to be that, during some sequence of events (@Leon-mintz can maybe provide logs), a sidecar charm unit ends up in a scenario where it has no address which can be used for the binding to a peer relation.

The pod *is* up and running, so there is an address, just not one which `network-get peer-relation` returns.

From our POV, a charm should *always* have binding addresses to interfaces which it provides as long as the pod is up and charm code is running. That it may not is definitely a bug.

Revision history for this message
Simon Aronsson (0x12b) wrote :

Yes, this is very much still an issue on 2.9.22 and 2.9.23.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.