Bug #2065168 “All containers restarts after docker.service has b...” : Bugs : kolla-ansible

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-05-08: Fix proposed to kolla-ansible (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/918639

Changed in kolla-ansible:
status:	New → In Progress

Revision history for this message

Sven Kieske (s-kieske) wrote on 2024-05-08:

#2

It's not clear to me from the description what the actual bug is supposed to be here, could you elaborate what "all container restarts with docker.service systemd service ignoring "live-restore"" means?

Do you mean containers are not restarted when the docker.service itself is restarted?

For reference, here is the documentation what "Requires=" actually does:

https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#Requires=

Thanks!

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#3

Hi, Sven

containers Are restarting with docker.service itself is restarted, even in case if live-restore option is set to True in daemon.json. It happens because of Requires=docker.service statement in kolla services systemd unit files.

So it is critical bug for production - any restart of docker.service will restart all openstack deployment

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#4

I think Sven would like to say that this is not related to the systemd container control added recently. Even in old/standard configuration restarting of the docker.service would lead to all containers restart. The live-restore option configure what to do with container after service restart, when live-restore enabled all the containers started again.

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#5

Actually, not exactly - "You can configure the daemon so that containers remain running if the daemon becomes unavailable. This functionality is called live restore" (https://docs.docker.com/config/containers/live-restore/)

In old/standard configuration (without systemd units for kolla containers) - when I configured "live-restore": true in docker/daemon.json - all containers are remained up and running during docker.service restarting. Now - because of "Requires=docker.service" statement in unit files - all the systemd services are triggered to be restarted together with docker.serivce.

So now - I can't control this behaviour. This is a subject of the bug. And it is really important for the production deployments.

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#6

Actually, "daemon becomes unavailable" != "daemon restarted by service" ;)

An quote from the link you're provided: "Restart the Docker daemon. On Linux, you can avoid a restart (and avoid any downtime for your containers) by reloading the Docker daemon. If you use systemd, then use the command systemctl reload docker. Otherwise, send a SIGHUP signal to the dockerd process."

So this is not a subject of the bug you're described.

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#7

Just for sure: https://paste.openstack.org/show/b8D9tp4XqW8jcAOgAnRu/

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#8

You just proved my point here (https://paste.openstack.org/show/b8D9tp4XqW8jcAOgAnRu/) "Up 3 seconds"
And if we will remove "Requires=docker.service" and keep "live restore" - we will become able to restart docker.service without affecting on containers.

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#9

Not proved, as you can see this not Kolla containers and there are no systemd units for containers.

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#10

Do you have "live-restore": true in your docker/daemon.json on this host?

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#11

Sure, did you read https://paste.openstack.org/show/b8D9tp4XqW8jcAOgAnRu/ ? at the lines number 6 and 7.

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#12

https://paste.openstack.org/show/824069/

Here is an example.

Before 2023.1, on Zed<= deployments w/o systemd units for kolla containers - we used "live-restore": true, and it worked fine for us.

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#13

Let's rewind back a little bit and describe more precise:

1. without systemd container control (https://review.opendev.org/c/openstack/kolla-ansible/+/816724) added in stable/2023.1 all containers restarted by the command 'systemctl restart docker.service' even ("live-restore": true) added to docker/daemon.json.

2. the documentation (https://docs.docker.com/config/containers/live-restore/) says:

2.1. "You can configure the daemon so that containers remain running if the daemon becomes unavailable."

and also says:

2.2. "Restart the Docker daemon. On Linux, you can avoid a restart (and avoid any downtime for your containers) by reloading the Docker daemon. If you use systemd, then use the command systemctl reload docker. Otherwise, send a SIGHUP signal to the dockerd process."

Please read 2.2 carefully. To safely restart docker daemon you should use 'systemctl reload' not 'systemctl restart' which will cause restart all your containers. But in Kolla-Ansible with systemd container control [1] this behaviour is changed, so don't quote docker documentation.

The behaviour of the container restart is controlled by 'restart_policy' and 'docker_restart_policy' now.
So, may be you should check your current deployment?

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#14

1. No, they kept alive during docker.service has restarted

2. The documentation said that you have to Reload docker daemon to Enable live-restore function after you putted it into the daemon.json config. After it you can Restart docker any way and all the containers will be alive and fun

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#15

Okay, please tell me why my containers restarted (https://paste.openstack.org/show/b8D9tp4XqW8jcAOgAnRu/) ?
This is CEPH node for Xena without (https://review.opendev.org/c/openstack/kolla-ansible/+/816724) when deployed. The docker/daemon.json contain ("live-restore": true).

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#16

https://paste.openstack.org/show/bPTY20qDk2gzUvluEe2q/

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-08:

#17

https://paste.openstack.org/show/bi07YFwsjfifbgSvt9In/

As we can see - ceph deploys also with systemd units, which also includes

Requires=docker.service

That is why your ceph has been restarted with docker.service

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-08:

#18

Oh, really, CEPH containers with systemd units is a bad example. Sorry.
Anyway, for "to Reload docker daemon to Enable live-restore function after you putted it into the daemon.json config" you should run 'systemctl daemon-reload' command.

Revision history for this message

Victor Chembaev (chembervint) wrote on 2024-05-09:

#19

Hi,

systemctl daemon-reload should be issued after any systemd unit file has beed changed.

If you configure service itself, for example docker - daemon.json is just a config file for Docker daemon - you have not do a systemctl daemon-reload. You have to reload just a service you have been configured yet - for example - systemctl reload docker

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2024-05-16:

#20

Lets discuss this on IRC

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-06-27: Fix merged to kolla-ansible (master)

#21

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/918639
Committed: https://opendev.org/openstack/kolla-ansible/commit/c0b0dddc393e69087797831c808508a6358b3c56
Submitter: "Zuul (22348)"
Branch: master

commit c0b0dddc393e69087797831c808508a6358b3c56
Author: Victor Chembaev <email address hidden>
Date: Wed May 8 16:49:54 2024 +0300

Fix kolla-ansible systemd restart behaviour

Fix kolla systemd unit template to prevent restart
all kolla services with docker.service restart

Change-Id: I70dd1751dea6bfc9bb265aeda04b3392e135324c
Closes-Bug: 2065168

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

Sven Kieske (s-kieske) wrote on 2024-06-27 (last edit on 2024-06-28):

#22

Did anybody read this part of the docker docs around this topic and somehow concluded it's not a problem?

to quote:

> Impact of live restore on running containers

> If the daemon is down for a long time, running containers may fill up the FIFO log the daemon normally reads. A full log blocks containers from logging more data. The default buffer size is 64K. If the buffers fill, you must restart the Docker daemon to flush them.

https://docs.docker.com/config/containers/live-restore/#impact-of-live-restore-on-running-containers

and also:

> Live restore allows you to keep containers running across Docker daemon updates, but is only supported when installing patch releases (YY.MM.x), not for major (YY.MM) daemon upgrades.

> If you skip releases during an upgrade, the daemon may not restore its connection to the containers. If the daemon can't restore the connection, it can't manage the running containers and you must stop them manually.

https://docs.docker.com/config/containers/live-restore/#live-restore-during-upgrades

so did someone check what we do when upgrading docker to major versions? are we aware that we need to manually restart the containers now and do we do this?

Did someone test, that the issue with pipes filling up is not an issue for our deployment model?

From my experience, filled up log pipes in docker daemon follow rather soon by filled up ram inside container and subsequent crashes of either the containers or complete oom situations on the host.

I hope someone can confirm that this is not a problem?

# Update with my comment from gerrit code review:

I had no knowledge that anybody is using that already in production! So if you got experience with it, I'm glad it works, it seems it's even enabled in our downstream as well, which I somehow missed (wrong grep I guess).

So apologies for making a fuzz.

Nevertheless it might cause problems if the docker daemon is down for extended periods of time, e.g. when an upgrade of the docker daemon didn't go well for users of live-restore and the containers are running for longer periods of time without being able to shuffle data over the docker pipe.

So it would've been nice if anybody had tested that prior to merging it.

Did anybody read this part of the docker docs around this topic and somehow concluded it's not a problem?

to quote:

> Impact of live restore on running containers

> If the daemon is down for a long time, running containers may fill up the FIFO log the daemon normally reads. A full log blocks containers from logging more data. The default buffer size is 64K. If the buffers fill, you must restart the Docker daemon to flush them.

https://docs.docker.com/config/containers/live-restore/#impact-of-live-restore-on-running-containers

and also:

> Live restore allows you to keep containers running across Docker daemon updates, but is only supported when installing patch releases (YY.MM.x), not for major (YY.MM) daemon upgrades.

> If you skip releases during an upgrade, the daemon may not restore its connection to the containers. If the daemon can't restore the connection, it can't manage the running containers and you must stop them manually.

https://docs.docker.com/config/containers/live-restore/#live-restore-during-upgrades

so did someone check what we do when upgrading docker to major versions? are we aware that we need to manually restart the containers now and do we do this?

Did someone test, that the issue with pipes filling up is not an issue for our deployment model?

From my experience, filled up log pipes in docker daemon follow rather soon by filled up ram inside container and subsequent crashes of either the containers or complete oom situations on the host.

I hope someone can confirm that this is not a problem?

# Update with my comment from gerrit code review:

I had no knowledge that anybody is using that already in production! So if you got experience with it, I'm glad it works, it seems it's even enabled in our downstream as well, which I somehow missed (wrong grep I guess).

So apologies for making a fuzz.

Nevertheless it might cause problems if the docker daemon is down for extended periods of time, e.g. when an upgrade of the docker daemon didn't go well for users of live-restore and the containers are running for longer periods of time without being able to shuffle data over the docker pipe.

So it would've been nice if anybody had tested that prior to merging it.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-06-27: Fix proposed to kolla-ansible (stable/2024.1)

#23

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/922960

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-06-28: Fix proposed to kolla-ansible (stable/2023.2)

#24

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/923020

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-06-28: Fix proposed to kolla-ansible (stable/2023.1)

#25

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/923021

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-07-03: Fix merged to kolla-ansible (stable/2024.1)

#26

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/922960
Committed: https://opendev.org/openstack/kolla-ansible/commit/b543ea4642620c14b737523b2e3677f7dc0557e7
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit b543ea4642620c14b737523b2e3677f7dc0557e7
Author: Victor Chembaev <email address hidden>
Date: Wed May 8 16:49:54 2024 +0300

Fix kolla-ansible systemd restart behaviour

Fix kolla systemd unit template to prevent restart
all kolla services with docker.service restart

    Change-Id: I70dd1751dea6bfc9bb265aeda04b3392e135324c
    Closes-Bug: 2065168
    (cherry picked from commit c0b0dddc393e69087797831c808508a6358b3c56)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-07-03: Fix merged to kolla-ansible (stable/2023.2)

#27

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/923020
Committed: https://opendev.org/openstack/kolla-ansible/commit/2241bb508096a859c20d08a08dd967369e5edb89
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 2241bb508096a859c20d08a08dd967369e5edb89
Author: Victor Chembaev <email address hidden>
Date: Wed May 8 16:49:54 2024 +0300

Fix kolla-ansible systemd restart behaviour

Fix kolla systemd unit template to prevent restart
all kolla services with docker.service restart

    Change-Id: I70dd1751dea6bfc9bb265aeda04b3392e135324c
    Closes-Bug: 2065168
    (cherry picked from commit c0b0dddc393e69087797831c808508a6358b3c56)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-07-03: Fix merged to kolla-ansible (stable/2023.1)

#28

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/923021
Committed: https://opendev.org/openstack/kolla-ansible/commit/d7c7576f64286d4c9b780a431b6c012f8407d764
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit d7c7576f64286d4c9b780a431b6c012f8407d764
Author: Victor Chembaev <email address hidden>
Date: Wed May 8 16:49:54 2024 +0300

Fix kolla-ansible systemd restart behaviour

Fix kolla systemd unit template to prevent restart
all kolla services with docker.service restart

    Change-Id: I70dd1751dea6bfc9bb265aeda04b3392e135324c
    Closes-Bug: 2065168
    (cherry picked from commit c0b0dddc393e69087797831c808508a6358b3c56)

kolla-ansible

All containers restarts after docker.service has been restarted

Bug Description

Other bug subscribers

Remote bug watches