Nova scheduler is stopped after each reboot

Bug #1845244 reported by Radosław Piliszek
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
High
Radosław Piliszek
Rocky
Fix Committed
High
Radosław Piliszek
Stein
Fix Committed
High
Radosław Piliszek
Train
Fix Released
High
Radosław Piliszek

Bug Description

Each reboot results in nova_scheduler container being down (properly stopped).

Simply deploy and reboot the node with nova_scheduler to reproduce it.

It does not seem to happen with docker restart.

Host: CentOS 7
Containers: CentOS 7 stein source

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

RestartPolicy is unless-stopped (correct)

There is seemingly nothing relevant in the logs of docker nor nova_scheduler (not running any debug atm).

Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: Undecided → High
Revision history for this message
Laurent Dumont (baconpackets) wrote :

I'm seeing the same behavior with Stein from source on Kolla 8.0.2

Nothing that seems of value in the container logs

2019-10-12 15:21:20.311 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 33 exited with status 0
2019-10-12 15:21:20.313 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 36 exited with status 0
2019-10-12 15:21:20.382 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 32 exited with status 0
2019-10-12 15:21:20.456 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 37 exited with status 0
2019-10-12 15:21:21.869 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 31 exited with status 0
2019-10-12 15:21:21.892 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 34 exited with status 0
2019-10-12 15:21:21.921 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 38 exited with status 0
2019-10-12 15:21:21.958 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 35 exited with status 0

## When I manually restarted the container - docker start xxxx

2019-10-12 15:32:04.835 6 INFO oslo_service.periodic_task [-] Skipping periodic task _discover_hosts_in_cells because its interval is negative
2019-10-12 15:32:05.371 6 INFO oslo_service.service [req-28f96a67-8ba8-4213-8963-73b52c4559c1 - - - - -] Starting 8 workers
2019-10-12 15:32:05.379 23 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.381 24 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.382 25 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.386 26 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.389 28 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.388 27 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.392 29 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.401 30 INFO nova.service [-] Starting scheduler node (version 19.0.2)

There is this for the container stdout. It's not repeated when restarting and it might be shown during the shutdown.

Exception TypeError: "'NoneType' object is not callable" in <bound method _SocketDuckForFd.__del__ of _SocketDuckForFd:6> ignored

Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: New → Triaged
milestone: none → 9.0.0
assignee: nobody → Mark Goddard (mgoddard)
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

docker 19.03.2

Revision history for this message
Will Szumski (willjs) wrote :

Why don't we use a restart policy of always by the way? Is it because we manually send containers signals for a clean shutdown?

Revision history for this message
Mark Goddard (mgoddard) wrote :

I have a CentOS 7 host with docker 19.03.4, ubuntu/binary/master containers, and do not see this issue. Tried three times to reproduce. I'll try again with centos containers.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Reproduced on the same host using centos/source/master images.

nova==20.0.0 (Train GA)

Revision history for this message
Mark Goddard (mgoddard) wrote :

Also reproduced on the same host using ubuntu/source/master images, nova==20.0.0.

Revision history for this message
Mark Goddard (mgoddard) wrote :

> Why don't we use a restart policy of always by the way? Is it because we manually send containers > signals for a clean shutdown?

It means containers can be stopped manually if necessary and Docker won't try to restart them.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Argh, now I can't reproduce on either centos or ubuntu containers :(

Revision history for this message
Mark Goddard (mgoddard) wrote :

Got it! It's caused by a Docker issue, and our use of SIGHUP in ansible/roles/nova/tasks/refresh_scheduler_cell_cache.yml.

You can reproduce the issue as follows:

docker kill --signal HUP nova_scheduler
systemctl restart docker
docker ps -a | grep nova_scheduler

Docker assumes that the signal will stop the container, and marks it as not being restartable. However SIGHUP typically does not stop a process.

This is captured in Docker bug https://github.com/moby/moby/issues/11065. Their solution appears to be to use --stop-signal to define which signal should be used to stop the container. This does not appear to work in my testing however.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/690982

Changed in kolla-ansible:
status: Triaged → In Progress
Revision history for this message
Will Szumski (willjs) wrote :

>> Why don't we use a restart policy of always by the way? Is it because we manually send containers signals for a clean shutdown?

> It means containers can be stopped manually if necessary and Docker won't try to restart them.

Isn't that what always does? From, https://docs.docker.com/config/containers/start-containers-automatically/:

always: Always restart the container if it stops. If it is manually stopped, it is restarted only when Docker daemon restarts or the container itself is manually restarted. (See the second bullet listed in restart policy details)

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

@Will,
If we ever want to be able to keep containers stopped on restarts (think docker upgrades or whatever that we finally coordinate), we need the current one.

@Mark,
Yeah, that explains why I got it from reboots and not daemon restarts - it had a high chance that I eventually rerun k-a between reboots. Thanks for debugging that.
Regarding fix - I don't think we should refrain from using SIGHUP there. We might want to keep the service working (this or another later). The bug is a bug but we can just signal the process without the docker's intervention.

Changed in kolla-ansible:
assignee: Mark Goddard (mgoddard) → Radosław Piliszek (yoctozepto)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/690982
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=6bdf202658e08bb9f43ca50334587b05dc4bac03
Submitter: Zuul
Branch: master

commit 6bdf202658e08bb9f43ca50334587b05dc4bac03
Author: Mark Goddard <email address hidden>
Date: Thu Oct 24 15:01:42 2019 +0100

    Fix nova scheduler down after first docker restart

    Due to a Docker bug [1] we cannot use Docker to send
    SIGHUP to the container because it will mark it as
    stopped.
    This patch sends the signal directly to the process,
    bypassing Docker.

    'changed_when: false' is also removed from the
    relevant task as it definitely changes the state.
    In the future we could do the refresh only if
    there really is a need for another one.

    [1] https://github.com/moby/moby/issues/11065

    Change-Id: Ief73bbd24568d6941384ea3330ab45f11aa42d37
    Co-authored-by: Radosław Piliszek <email address hidden>
    Closes-Bug: #1845244

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/692212

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/692213

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/rocky)

Reviewed: https://review.opendev.org/692213
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=d4734437b1d828013384cb20475981def569d6af
Submitter: Zuul
Branch: stable/rocky

commit d4734437b1d828013384cb20475981def569d6af
Author: Mark Goddard <email address hidden>
Date: Thu Oct 24 15:01:42 2019 +0100

    Fix nova scheduler down after first docker restart

    Due to a Docker bug [1] we cannot use Docker to send
    SIGHUP to the container because it will mark it as
    stopped.
    This patch sends the signal directly to the process,
    bypassing Docker.

    'changed_when: false' is also removed from the
    relevant task as it definitely changes the state.
    In the future we could do the refresh only if
    there really is a need for another one.

    [1] https://github.com/moby/moby/issues/11065

    Change-Id: Ief73bbd24568d6941384ea3330ab45f11aa42d37
    Co-authored-by: Radosław Piliszek <email address hidden>
    Closes-Bug: #1845244
    (cherry picked from commit 6bdf202658e08bb9f43ca50334587b05dc4bac03)
    (cherry picked from commit 2242fceb73abac54ef46acb41997bf81a77692a3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/692212
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=2242fceb73abac54ef46acb41997bf81a77692a3
Submitter: Zuul
Branch: stable/stein

commit 2242fceb73abac54ef46acb41997bf81a77692a3
Author: Mark Goddard <email address hidden>
Date: Thu Oct 24 15:01:42 2019 +0100

    Fix nova scheduler down after first docker restart

    Due to a Docker bug [1] we cannot use Docker to send
    SIGHUP to the container because it will mark it as
    stopped.
    This patch sends the signal directly to the process,
    bypassing Docker.

    'changed_when: false' is also removed from the
    relevant task as it definitely changes the state.
    In the future we could do the refresh only if
    there really is a need for another one.

    [1] https://github.com/moby/moby/issues/11065

    Change-Id: Ief73bbd24568d6941384ea3330ab45f11aa42d37
    Co-authored-by: Radosław Piliszek <email address hidden>
    Closes-Bug: #1845244
    (cherry picked from commit 6bdf202658e08bb9f43ca50334587b05dc4bac03)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 9.0.0.0rc1 release candidate.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
milestone: 9.0.0 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 7.2.0

This issue was fixed in the openstack/kolla-ansible 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.1.0

This issue was fixed in the openstack/kolla-ansible 8.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.