Bug #1845244 “Nova scheduler is stopped after each reboot” : Bugs : kolla-ansible

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2019-09-24:

#1

RestartPolicy is unless-stopped (correct)

There is seemingly nothing relevant in the logs of docker nor nova_scheduler (not running any debug atm).

Mark Goddard (mgoddard) on 2019-09-25

Changed in kolla-ansible:
importance:	Undecided → High

Revision history for this message

Laurent Dumont (baconpackets) wrote on 2019-10-12:

#2

I'm seeing the same behavior with Stein from source on Kolla 8.0.2

Nothing that seems of value in the container logs

2019-10-12 15:21:20.311 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 33 exited with status 0
2019-10-12 15:21:20.313 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 36 exited with status 0
2019-10-12 15:21:20.382 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 32 exited with status 0
2019-10-12 15:21:20.456 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 37 exited with status 0
2019-10-12 15:21:21.869 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 31 exited with status 0
2019-10-12 15:21:21.892 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 34 exited with status 0
2019-10-12 15:21:21.921 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 38 exited with status 0
2019-10-12 15:21:21.958 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 35 exited with status 0

## When I manually restarted the container - docker start xxxx

2019-10-12 15:32:04.835 6 INFO oslo_service.periodic_task [-] Skipping periodic task _discover_hosts_in_cells because its interval is negative
2019-10-12 15:32:05.371 6 INFO oslo_service.service [req-28f96a67-8ba8-4213-8963-73b52c4559c1 - - - - -] Starting 8 workers
2019-10-12 15:32:05.379 23 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.381 24 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.382 25 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.386 26 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.389 28 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.388 27 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.392 29 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.401 30 INFO nova.service [-] Starting scheduler node (version 19.0.2)

There is this for the container stdout. It's not repeated when restarting and it might be shown during the shutdown.

Exception TypeError: "'NoneType' object is not callable" in <bound method _SocketDuckForFd.__del__ of _SocketDuckForFd:6> ignored

I'm seeing the same behavior with Stein from source on Kolla 8.0.2

Nothing that seems of value in the container logs

2019-10-12 15:21:20.311 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 33 exited with status 0
2019-10-12 15:21:20.313 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 36 exited with status 0
2019-10-12 15:21:20.382 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 32 exited with status 0
2019-10-12 15:21:20.456 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 37 exited with status 0
2019-10-12 15:21:21.869 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 31 exited with status 0
2019-10-12 15:21:21.892 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 34 exited with status 0
2019-10-12 15:21:21.921 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 38 exited with status 0
2019-10-12 15:21:21.958 6 INFO oslo_service.service [req-4993422c-8481-49bf-ac05-f3b51f89d734 - - - - -] Child 35 exited with status 0

## When I manually restarted the container - docker start xxxx

2019-10-12 15:32:04.835 6 INFO oslo_service.periodic_task [-] Skipping periodic task _discover_hosts_in_cells because its interval is negative
2019-10-12 15:32:05.371 6 INFO oslo_service.service [req-28f96a67-8ba8-4213-8963-73b52c4559c1 - - - - -] Starting 8 workers
2019-10-12 15:32:05.379 23 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.381 24 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.382 25 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.386 26 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.389 28 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.388 27 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.392 29 INFO nova.service [-] Starting scheduler node (version 19.0.2)
2019-10-12 15:32:05.401 30 INFO nova.service [-] Starting scheduler node (version 19.0.2)

There is this for the container stdout. It's not repeated when restarting and it might be shown during the shutdown.

Exception TypeError: "'NoneType' object is not callable" in <bound method _SocketDuckForFd.__del__ of _SocketDuckForFd:6> ignored

Mark Goddard (mgoddard) on 2019-10-23

Changed in kolla-ansible:
status:	New → Triaged
milestone:	none → 9.0.0
assignee:	nobody → Mark Goddard (mgoddard)

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2019-10-23:

#3

docker 19.03.2

Revision history for this message

Will Szumski (willjs) wrote on 2019-10-23:

#4

Why don't we use a restart policy of always by the way? Is it because we manually send containers signals for a clean shutdown?

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-10-23:

#5

I have a CentOS 7 host with docker 19.03.4, ubuntu/binary/master containers, and do not see this issue. Tried three times to reproduce. I'll try again with centos containers.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-10-24:

#6

Reproduced on the same host using centos/source/master images.

nova==20.0.0 (Train GA)

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-10-24:

#7

Also reproduced on the same host using ubuntu/source/master images, nova==20.0.0.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-10-24:

#8

> Why don't we use a restart policy of always by the way? Is it because we manually send containers > signals for a clean shutdown?

It means containers can be stopped manually if necessary and Docker won't try to restart them.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-10-24:

#9

Argh, now I can't reproduce on either centos or ubuntu containers :(

Revision history for this message

Mark Goddard (mgoddard) wrote on 2019-10-24:

#10

Got it! It's caused by a Docker issue, and our use of SIGHUP in ansible/roles/nova/tasks/refresh_scheduler_cell_cache.yml.

You can reproduce the issue as follows:

docker kill --signal HUP nova_scheduler
systemctl restart docker
docker ps -a | grep nova_scheduler

Docker assumes that the signal will stop the container, and marks it as not being restartable. However SIGHUP typically does not stop a process.

This is captured in Docker bug https://github.com/moby/moby/issues/11065. Their solution appears to be to use --stop-signal to define which signal should be used to stop the container. This does not appear to work in my testing however.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-24: Fix proposed to kolla-ansible (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/690982

Changed in kolla-ansible:
status:	Triaged → In Progress

Revision history for this message

Will Szumski (willjs) wrote on 2019-10-24:

#12

>> Why don't we use a restart policy of always by the way? Is it because we manually send containers signals for a clean shutdown?

> It means containers can be stopped manually if necessary and Docker won't try to restart them.

Isn't that what always does? From, https://docs.docker.com/config/containers/start-containers-automatically/:

always: Always restart the container if it stops. If it is manually stopped, it is restarted only when Docker daemon restarts or the container itself is manually restarted. (See the second bullet listed in restart policy details)

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2019-10-25:

#13

@Will,
If we ever want to be able to keep containers stopped on restarts (think docker upgrades or whatever that we finally coordinate), we need the current one.

@Mark,
Yeah, that explains why I got it from reboots and not daemon restarts - it had a high chance that I eventually rerun k-a between reboots. Thanks for debugging that.
Regarding fix - I don't think we should refrain from using SIGHUP there. We might want to keep the service working (this or another later). The bug is a bug but we can just signal the process without the docker's intervention.

OpenStack Infra (hudson-openstack) on 2019-10-25

Changed in kolla-ansible:
assignee:	Mark Goddard (mgoddard) → Radosław Piliszek (yoctozepto)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-30: Fix merged to kolla-ansible (master)

#14

Reviewed: https://review.opendev.org/690982
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=6bdf202658e08bb9f43ca50334587b05dc4bac03
Submitter: Zuul
Branch: master

commit 6bdf202658e08bb9f43ca50334587b05dc4bac03
Author: Mark Goddard <email address hidden>
Date: Thu Oct 24 15:01:42 2019 +0100

Fix nova scheduler down after first docker restart

    Due to a Docker bug [1] we cannot use Docker to send
    SIGHUP to the container because it will mark it as
    stopped.
    This patch sends the signal directly to the process,
    bypassing Docker.

    'changed_when: false' is also removed from the
    relevant task as it definitely changes the state.
    In the future we could do the refresh only if
    there really is a need for another one.

[1] https://github.com/moby/moby/issues/11065

    Change-Id: Ief73bbd24568d6941384ea3330ab45f11aa42d37
    Co-authored-by: Radosław Piliszek <email address hidden>
    Closes-Bug: #1845244

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-30: Fix proposed to kolla-ansible (stable/stein)

#15

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/692212

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-30: Fix proposed to kolla-ansible (stable/rocky)

#16

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/692213

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-05: Fix merged to kolla-ansible (stable/rocky)

#17

Reviewed: https://review.opendev.org/692213
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=d4734437b1d828013384cb20475981def569d6af
Submitter: Zuul
Branch: stable/rocky

commit d4734437b1d828013384cb20475981def569d6af
Author: Mark Goddard <email address hidden>
Date: Thu Oct 24 15:01:42 2019 +0100

Fix nova scheduler down after first docker restart

    Due to a Docker bug [1] we cannot use Docker to send
    SIGHUP to the container because it will mark it as
    stopped.
    This patch sends the signal directly to the process,
    bypassing Docker.

    'changed_when: false' is also removed from the
    relevant task as it definitely changes the state.
    In the future we could do the refresh only if
    there really is a need for another one.

[1] https://github.com/moby/moby/issues/11065

    Change-Id: Ief73bbd24568d6941384ea3330ab45f11aa42d37
    Co-authored-by: Radosław Piliszek <email address hidden>
    Closes-Bug: #1845244
    (cherry picked from commit 6bdf202658e08bb9f43ca50334587b05dc4bac03)
    (cherry picked from commit 2242fceb73abac54ef46acb41997bf81a77692a3)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-05: Fix merged to kolla-ansible (stable/stein)

#18

Reviewed: https://review.opendev.org/692212
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=2242fceb73abac54ef46acb41997bf81a77692a3
Submitter: Zuul
Branch: stable/stein

commit 2242fceb73abac54ef46acb41997bf81a77692a3
Author: Mark Goddard <email address hidden>
Date: Thu Oct 24 15:01:42 2019 +0100

Fix nova scheduler down after first docker restart

    Due to a Docker bug [1] we cannot use Docker to send
    SIGHUP to the container because it will mark it as
    stopped.
    This patch sends the signal directly to the process,
    bypassing Docker.

    'changed_when: false' is also removed from the
    relevant task as it definitely changes the state.
    In the future we could do the refresh only if
    there really is a need for another one.

[1] https://github.com/moby/moby/issues/11065

    Change-Id: Ief73bbd24568d6941384ea3330ab45f11aa42d37
    Co-authored-by: Radosław Piliszek <email address hidden>
    Closes-Bug: #1845244
    (cherry picked from commit 6bdf202658e08bb9f43ca50334587b05dc4bac03)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-11: Fix included in openstack/kolla-ansible 9.0.0.0rc1

#19

This issue was fixed in the openstack/kolla-ansible 9.0.0.0rc1 release candidate.

Mark Goddard (mgoddard) on 2019-11-15

Changed in kolla-ansible:
milestone:	9.0.0 → none

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-30: Fix included in openstack/kolla-ansible 7.2.0

#20

This issue was fixed in the openstack/kolla-ansible 7.2.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-30: Fix included in openstack/kolla-ansible 8.1.0

#21

This issue was fixed in the openstack/kolla-ansible 8.1.0 release.

kolla-ansible

Nova scheduler is stopped after each reboot

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
kolla-ansible	Fix Released	High	Radosław Piliszek
Rocky	Fix Committed	High	Radosław Piliszek	kolla-ansible 7.2.0 "Rocky"
Stein	Fix Committed	High	Radosław Piliszek	kolla-ansible 8.1.0 "Stein"
Train	Fix Released	High	Radosław Piliszek	kolla-ansible 9.0.0 "Train"