Containers are killed before any 'on stop/remove' handlers have a chance to run

Bug #1951415 reported by Jack Shaw
68
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Thomas Miller

Bug Description

When scaling down or removing units from a k8s charm, containers are killed almost immediately, leading to consistent errors and making it impossible to perform any cleanup actions on/with the container

This was found on juju 2.9.18 and micrk8s 1.21.6

This reliably leads to the following error on the charm container, which hangs around for longer:
```
unit-bar-0: 12:05:22 ERROR juju.worker.uniter pebble poll failed for container "bar": failed to get pebble info: cannot obtain system details: cannot communicate with server: Get "http://localhost/v1/system-info": dial unix /charm/containers/bar/pebble.socket: connect: no such file or directory
```

This can be replicated by deploying, then removing the following charm:
```
#!/usr/bin/env python3

import logging
import os
import tenacity

import ops.charm
import ops.framework
import ops.main
import ops.model
import ops.pebble

logger = logging.getLogger(__name__)

class FooCharm(ops.charm.CharmBase):

    def __init__(self, *args):
        super().__init__(*args)
        self.framework.observe(self.on.bar_pebble_ready, self._on_bar_pebble_ready)
        self.framework.observe(self.on.remove, self._on_remove)

    def _on_bar_pebble_ready(self, event):
        container = event.workload

        desired_layer = {
            'summary': 'layer',
            'description': '',
            'services': {
                "foo": {
                    'override': 'replace',
                    'summary': "foo",
                    'command': "sleep infinity",
                    'startup': 'enabled',
                    'environment': { }
                }
            }
        }
        container.add_layer("foo", desired_layer, combine=True)
        container.autostart()
        self.unit.status = ops.model.ActiveStatus()

    def _on_remove(self, _):
        logger.info("stopping service")
        container = self.unit.get_container('bar')
        container.stop("foo")

if __name__ == '__main__':
    ops.main.main(FooCharm)
```

And will result in the traceback:
```
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1350, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 950, in send
    self.connect()
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/pebble.py", line 58, in connect
    self.sock.connect(self.socket_path)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/pebble.py", line 729, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/pebble.py", line 72, in http_open
    return self.do_open(_UnixSocketConnection, req, socket_path=self.socket_path)
  File "/usr/lib/python3.8/urllib/request.py", line 1353, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 52, in <module>
    ops.main.main(FooCharm)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/main.py", line 406, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/main.py", line 140, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/framework.py", line 278, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/framework.py", line 722, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/framework.py", line 767, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 48, in _on_remove
    container.stop("foo")
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/model.py", line 1050, in stop
    self._pebble.stop_services(service_names)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/pebble.py", line 813, in stop_services
    return self._services_action('stop', services, timeout, delay)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/pebble.py", line 826, in _services_action
    resp = self._request('POST', '/v1/services', body=body)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/pebble.py", line 700, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-bar-0/charm/venv/ops/pebble.py", line 742, in _request_raw
    raise ConnectionError(e.reason)
ops.pebble.ConnectionError: [Errno 2] No such file or directory
```

Jack Shaw (jack-shaw)
Changed in juju:
importance: Undecided → High
Harry Pidcock (hpidcock)
Changed in juju:
assignee: nobody → Harry Pidcock (hpidcock)
status: New → In Progress
Revision history for this message
Harry Pidcock (hpidcock) wrote :
Changed in juju:
milestone: none → 2.9-next
Harry Pidcock (hpidcock)
Changed in juju:
milestone: 2.9-next → 2.9.32
Harry Pidcock (hpidcock)
Changed in juju:
milestone: 2.9.32 → 2.9.33
Changed in juju:
milestone: 2.9.33 → 2.9.34
Revision history for this message
Harry Pidcock (hpidcock) wrote :

Just a quick update on this bug, we think we've got some consensus on the approach to resolve this issue. There is a few moving parts to get right first and will require pebble, juju and likely operator framework work to support it. Will keep you all updated as this progresses, looking to dedicate an engineer to work on this soon.

Harry Pidcock (hpidcock)
Changed in juju:
assignee: Harry Pidcock (hpidcock) → Thomas Miller (tlmiller)
Changed in juju:
milestone: 2.9.34 → 2.9.35
Changed in juju:
milestone: 2.9.35 → 2.9.36
Changed in juju:
milestone: 2.9.36 → 2.9.37
Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote :
Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote :

And here are the logs for the COS layer.

Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote (last edit ):

FWIW, on the Solutions QA team we found an instance of this bug. Here are all the logs for the K8s layer.

Changed in juju:
milestone: 2.9.37 → 2.9.38
Revision history for this message
Thomas Miller (tlmiller) wrote :

Hi All,

Just an update on this bug progress.

- We have released the first part of this fix which is allowing the Juju components to run under Pebble.
- Due to the change above we have found several bugs in Juju that caused us to do some significant work around our update policy of sidecar charms.
- We currently have the second lot of changes pending for this work to do with Pebble around making pebble send sigterm to workload processes correctly.
- As of today we have the next changes demonstrating that the hooks are called correctly within the charm on sigterm.

What remains is still a little more work around pebble to hold the sigterm sent by Kubernetes until the Juju container agent signals the hooks have run.

Please feel free to reach out if I can supply anymore information for the above.

Changed in juju:
milestone: 2.9.38 → 2.9.39
Changed in juju:
milestone: 2.9.39 → 2.9.40
Revision history for this message
Thomas Miller (tlmiller) wrote :

Pebble PR: https://github.com/canonical/pebble/pull/184

Juju PR to follow shortly.

Changed in juju:
milestone: 2.9.40 → 2.9.41
Changed in juju:
milestone: 2.9.41 → 2.9.42
Changed in juju:
milestone: 2.9.42 → 2.9.43
Revision history for this message
Harry Pidcock (hpidcock) wrote :
Changed in juju:
milestone: 2.9.43 → 2.9.44
Jack Shaw (jack-shaw)
Changed in juju:
milestone: 2.9.44 → 2.9.43
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.