rebooting an HA controller node can take around 20minutes

Bug #1792701 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

When issuing a normal reboot command on an overcloud node the following stop sequence can take place:
------------- -----------------------------
| Pacemaker | | paunch-container-shutdown |
------------- -----------------------------
        \ /
         \ /
          \ /
        -----------------
        | docker |
        -----------------

We currently have two issues:
A) paunch-container-shutdown has no 'After=pacemaker.service' in the service definition
We want this for two reasons:
A.1) We want the normal openstack services to be able to stop before the DB/rabbit so they can run any actions that need to store state on shutdown correctly.
A.2) paunch-container-shutdown stops 10 containers concurrently at the same time which on some environments creates quite the load. Doing that and shutting down the HA containers at the same time is less than desirable.

B) If there are docker plugins that are allowed to stop before docker and also before pacemaker, it might happen that stopping them down during the pacemaker stop will cause a bunch of timeouts and a failure to stop containers:
Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms

One of these plugins is 'rhel-push-plugin.service'

So B) is the root cause for the timeout, but it is quite desirable to fix A) as well.

Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/602828

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :
Changed in tripleo:
milestone: rocky-rc2 → stein-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/602828
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e288dbd8252765020816639b9b53f8212292cfaf
Submitter: Zuul
Branch: master

commit e288dbd8252765020816639b9b53f8212292cfaf
Author: Michele Baldessari <email address hidden>
Date: Sat Sep 15 15:19:26 2018 +0200

    Make sure rhel-plugin-push.service is stopped after pacemaker stops

    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              | |
               \ /
                \ /
            ----------
            | docker |
            ----------

    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms

    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.

    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.

    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.

    NB: We add the symlink unconditionally as systemd will ignore it if the
    service is not installed.

    Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
    Closes-Bug: #1792701

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/606848

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/606849

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.openstack.org/606848
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=a6eaab1406d47c8e368130e705ee9f7b59c59b4f
Submitter: Zuul
Branch: stable/rocky

commit a6eaab1406d47c8e368130e705ee9f7b59c59b4f
Author: Michele Baldessari <email address hidden>
Date: Sat Sep 15 15:19:26 2018 +0200

    Make sure rhel-plugin-push.service is stopped after pacemaker stops

    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              | |
               \ /
                \ /
            ----------
            | docker |
            ----------

    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms

    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.

    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.

    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.

    NB: We add the symlink unconditionally as systemd will ignore it if the
    service is not installed.

    Closes-Bug: #1792701

    Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
    (cherry picked from commit e288dbd8252765020816639b9b53f8212292cfaf)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.openstack.org/606849
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5f43470da1f80675ac6144136ec8e60f23f9356b
Submitter: Zuul
Branch: stable/queens

commit 5f43470da1f80675ac6144136ec8e60f23f9356b
Author: Michele Baldessari <email address hidden>
Date: Sat Sep 15 15:19:26 2018 +0200

    Make sure rhel-plugin-push.service is stopped after pacemaker stops

    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              | |
               \ /
                \ /
            ----------
            | docker |
            ----------

    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms

    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.

    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.

    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.

    NB: We add the symlink unconditionally as systemd will ignore it if the
    service is not installed.

    Closes-Bug: #1792701

    Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
    (cherry picked from commit e288dbd8252765020816639b9b53f8212292cfaf)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.0.0

This issue was fixed in the openstack/puppet-tripleo 10.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.4.0

This issue was fixed in the openstack/puppet-tripleo 8.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.4.0

This issue was fixed in the openstack/puppet-tripleo 9.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.