Bug #1792701 “rebooting an HA controller node can take around 20...” : Bugs : tripleo

Revision history for this message

Michele Baldessari (michele) wrote on 2018-09-15:

#1

A) is fixed via https://review.rdoproject.org/r/16279

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-15: Fix proposed to puppet-tripleo (master)

#2

Fix proposed to branch: master
Review: https://review.openstack.org/602828

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

Michele Baldessari (michele) wrote on 2018-09-15:

#3

B) Is fixed via https://review.openstack.org/602828

Alex Schultz (alex-schultz) on 2018-09-18

Changed in tripleo:
milestone:	rocky-rc2 → stein-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-30: Fix merged to puppet-tripleo (master)

#4

Reviewed: https://review.openstack.org/602828
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e288dbd8252765020816639b9b53f8212292cfaf
Submitter: Zuul
Branch: master

commit e288dbd8252765020816639b9b53f8212292cfaf
Author: Michele Baldessari <email address hidden>
Date: Sat Sep 15 15:19:26 2018 +0200

Make sure rhel-plugin-push.service is stopped after pacemaker stops

    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              | |
               \ /
                \ /
            ----------
            | docker |
            ----------

    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms

    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.

    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.

    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.

NB: We add the symlink unconditionally as systemd will ignore it if the
service is not installed.

Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
Closes-Bug: #1792701

Reviewed:  https://review.openstack.org/602828
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e288dbd8252765020816639b9b53f8212292cfaf
Submitter: Zuul
Branch:    master

commit e288dbd8252765020816639b9b53f8212292cfaf
Author: Michele Baldessari <michele@acksyn.org>
Date:   Sat Sep 15 15:19:26 2018 +0200

Make sure rhel-plugin-push.service is stopped after pacemaker stops
    
    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              |     |
               \   /
                \ /
            ----------
            | docker |
            ----------
    
    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms
    
    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.
    
    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.
    
    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.
    
    NB: We add the symlink unconditionally as systemd will ignore it if the
    service is not installed.
    
    Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
    Closes-Bug: #1792701

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-30: Fix proposed to puppet-tripleo (stable/rocky)

#5

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/606848

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-30: Fix proposed to puppet-tripleo (stable/queens)

#6

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/606849

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-01: Fix merged to puppet-tripleo (stable/rocky)

#7

Reviewed: https://review.openstack.org/606848
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=a6eaab1406d47c8e368130e705ee9f7b59c59b4f
Submitter: Zuul
Branch: stable/rocky

commit a6eaab1406d47c8e368130e705ee9f7b59c59b4f
Author: Michele Baldessari <email address hidden>
Date: Sat Sep 15 15:19:26 2018 +0200

Make sure rhel-plugin-push.service is stopped after pacemaker stops

    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              | |
               \ /
                \ /
            ----------
            | docker |
            ----------

    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms

    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.

    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.

    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.

NB: We add the symlink unconditionally as systemd will ignore it if the
service is not installed.

Closes-Bug: #1792701

Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
(cherry picked from commit e288dbd8252765020816639b9b53f8212292cfaf)

Reviewed:  https://review.openstack.org/606848
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=a6eaab1406d47c8e368130e705ee9f7b59c59b4f
Submitter: Zuul
Branch:    stable/rocky

commit a6eaab1406d47c8e368130e705ee9f7b59c59b4f
Author: Michele Baldessari <michele@acksyn.org>
Date:   Sat Sep 15 15:19:26 2018 +0200

Make sure rhel-plugin-push.service is stopped after pacemaker stops
    
    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              |     |
               \   /
                \ /
            ----------
            | docker |
            ----------
    
    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms
    
    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.
    
    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.
    
    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.
    
    NB: We add the symlink unconditionally as systemd will ignore it if the
    service is not installed.
    
    Closes-Bug: #1792701
    
    Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
    (cherry picked from commit e288dbd8252765020816639b9b53f8212292cfaf)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-01: Fix merged to puppet-tripleo (stable/queens)

#8

Reviewed: https://review.openstack.org/606849
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5f43470da1f80675ac6144136ec8e60f23f9356b
Submitter: Zuul
Branch: stable/queens

commit 5f43470da1f80675ac6144136ec8e60f23f9356b
Author: Michele Baldessari <email address hidden>
Date: Sat Sep 15 15:19:26 2018 +0200

Make sure rhel-plugin-push.service is stopped after pacemaker stops

    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              | |
               \ /
                \ /
            ----------
            | docker |
            ----------

    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms

    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.

    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.

    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.

NB: We add the symlink unconditionally as systemd will ignore it if the
service is not installed.

Closes-Bug: #1792701

Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
(cherry picked from commit e288dbd8252765020816639b9b53f8212292cfaf)

Reviewed:  https://review.openstack.org/606849
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5f43470da1f80675ac6144136ec8e60f23f9356b
Submitter: Zuul
Branch:    stable/queens

commit 5f43470da1f80675ac6144136ec8e60f23f9356b
Author: Michele Baldessari <michele@acksyn.org>
Date:   Sat Sep 15 15:19:26 2018 +0200

Make sure rhel-plugin-push.service is stopped after pacemaker stops
    
    When issuing a normal reboot command on an overcloud node the following
    stop sequence can take place:
    ------------- -----------------------------
    | Pacemaker | | paunch-container-shutdown |
    ------------- -----------------------------
              |     |
               \   /
                \ /
            ----------
            | docker |
            ----------
    
    If there are docker plugins that are allowed to stop before docker and
    also before pacemaker, it might happen that stopping them down during
    the pacemaker stop will cause a bunch of timeouts and a failure to stop
    containers:
    Sep 13 17:53:00.821030 controller-0.localdomain pacemakerd[6147]: notice: Shutting down Pacemaker
    Sep 13 17:54:15.798026 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000 process (PID 226329) timed out
    Sep 13 17:54:15.799004 controller-0.localdomain lrmd[6284]: warning: galera-bundle-docker-0_monitor_60000:226329 - timed out after 20000ms
    
    One of these plugins is 'rhel-push-plugin.service'. It seems that when
    this plugin is free to stop before docker on shutdown, it is very
    possible that docker commands can start timing out.
    
    Before:
    Before adding the symlink we would need 15mins to reboot a node and
    we would get a bunch of timeouts on shutdown and some failed actions on
    boot.
    
    After:
    A reboot will take a reasonable couple of minutes to complete with no
    failed actions at boot and timeouts during shutdown.
    
    NB: We add the symlink unconditionally as systemd will ignore it if the
    service is not installed.
    
    Closes-Bug: #1792701
    
    Change-Id: I6f6d27f2457efcc49d9edd8a2f98484c5f7c0933
    (cherry picked from commit e288dbd8252765020816639b9b53f8212292cfaf)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-08: Fix included in openstack/puppet-tripleo 10.0.0

#9

This issue was fixed in the openstack/puppet-tripleo 10.0.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-15: Fix included in openstack/puppet-tripleo 8.4.0

#10

This issue was fixed in the openstack/puppet-tripleo 8.4.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-14: Fix included in openstack/puppet-tripleo 9.4.0

#11

This issue was fixed in the openstack/puppet-tripleo 9.4.0 release.

tripleo

rebooting an HA controller node can take around 20minutes

Bug Description

Other bug subscribers

Remote bug watches