Reloading compute with SIGHUP prevents instances from booting

Bug #1715374 reported by Sahid Orentino on 2017-09-06
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Ralf Haferkamp
openstack-ansible
Critical
Mohammed Naser
oslo.service
Undecided
Mohammed Naser
tripleo
Critical
Emilien Macchi

Bug Description

When trying to boot a new instance at a compute-node, where nova-compute received SIGHUP(the SIGHUP is used as a trigger for reloading mutable options), it always failed.

  ========== nova/compute/manager.py ==============
    def cancel_all_events(self):
        if self._events is None:
            LOG.debug('Unexpected attempt to cancel events during shutdown.')
            return
        our_events = self._events
        # NOTE(danms): Block new events
        self._events = None <--- Set self._events to "None"
    ...
    =================================================

  This will cause a NovaException when prepare_for_instance_event() was called.
  It's the cause of the failure of network allocation.

    ========== nova/compute/manager.py ==============
    def prepare_for_instance_event(self, instance, event_name):
        ...
        if self._events is None:
            # NOTE(danms): We really should have a more specific error
            # here, but this is what we use for our default error case
            raise exception.NovaException('In shutdown, no new events '
                                          'can be scheduled')
    =================================================

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
tags: added: compute
tags: added: low-hanging-fruit
Changed in tripleo:
status: New → In Progress
importance: Undecided → Critical
milestone: none → stein-1
assignee: nobody → Bogdan Dobrelya (bogdando)

Reviewed: https://review.openstack.org/596277
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=305445bac6ccf2cb491f451412035f08aa2928ba
Submitter: Zuul
Branch: master

commit 305445bac6ccf2cb491f451412035f08aa2928ba
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200

    Add container runtime packages for cron image

    Logrotate-crond needs docker binary (from package) and docker
    socket to restart some non-SIGHUP friendly containers.

    Related-Bug: #1715374
    Related-Bug: #1780139

    Change-Id: I73e7bb6e5ba4a79a3158a1878bf8b26a539e3e2f
    Signed-off-by: Bogdan Dobrelya <email address hidden>

wes hayutin (weshayutin) on 2018-09-22
tags: added: alert

Reviewed: https://review.openstack.org/604006
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=fc31a1c649cca06d0340a15a55339757ff002775
Submitter: Zuul
Branch: stable/rocky

commit fc31a1c649cca06d0340a15a55339757ff002775
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200

    Add container runtime packages for cron image

    Logrotate-crond needs docker binary (from package) and docker
    socket to restart some non-SIGHUP friendly containers.

    Related-Bug: #1715374
    Related-Bug: #1780139

    Change-Id: I73e7bb6e5ba4a79a3158a1878bf8b26a539e3e2f
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 305445bac6ccf2cb491f451412035f08aa2928ba)

tags: added: in-stable-rocky
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Emilien Macchi (emilienm)
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
status: In Progress → Fix Released
status: Fix Released → Triaged
Emilien Macchi (emilienm) wrote :

(and the backports)

Reviewed: https://review.openstack.org/596275
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Submitter: Zuul
Branch: master

commit d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:15:51 2018 +0200

    Fix postrotate to notify holders of rotated logs

    Lsof +L1 locates unlinked and open files and does not work for
    logrotate, neither with copyteuncate not w/o that option.

    Instead, find *.X (X - number) files held and notify the processes
    owning those to make an apropriate actions and reopen new log files to
    stop writing to the rotated files.

    The actions to be taken by such processes are:

    * For httpd processes, use USR1 to gracefully reload
    * For neutron-server, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1780139).
    * For nova-compute, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1715374).
    * For other processes, use HUP to reload

    This also fixes the filter to match logfiles ending with *err,
    like rabbitmq startup errors log.

    Closes-Bug: #1780139
    Closes-Bug: #1785659
    Closes-Bug: #1715374

    Change-Id: I5110426aa26e5fce7ebb4d80d8a2082cbf80519c
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: Triaged → Fix Released

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/604708
Reason: not needed any more as we fallback to copytruncate

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/601558
Reason: not needed any more as we fallback to copytruncate

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/601544
Reason: implementation falls back to copytruncate making this not needed any more

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/606908
Reason: implementation falls back to copytruncate making this not needed any more

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/606864
Reason: implementation falls back to copytruncate making this not needed any more

This issue was fixed in the openstack/puppet-tripleo 10.0.0 release.

We can't fix that for tripleo as it requires nova fixes for SIGHUP processing, so marked it invalid (for tripleo). Logrotate configs had been switched from sending SIGHUP to using copytruncate, so tripleo is not affected with this anymore.

Changed in tripleo:
status: Fix Released → Invalid
Changed in tripleo:
status: Invalid → Won't Fix

Change abandoned by sahid (<email address hidden>) on branch: master
Review: https://review.openstack.org/420026

I've just resubmitted the old patch (with smaller changes to address the feedback) as this but can cause serious issue for upgrades when following the official documentation.

Matt Riedemann (mriedem) on 2019-01-25
Changed in nova:
status: Confirmed → In Progress
assignee: nobody → Ralf Haferkamp (rhafer)
importance: Low → High

Change abandoned by Ralf Haferkamp (<email address hidden>) on branch: master
Review: https://review.openstack.org/633220

Jay Pipes (jaypipes) on 2019-02-15
summary: - Reloading compute with SIGHUP prenvents instances to boot
+ Reloading compute with SIGHUP prevents instances from booting
Mohammed Naser (mnaser) on 2019-03-07
Changed in openstack-ansible:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Mohammed Naser (mnaser)

Fix proposed to branch: master
Review: https://review.openstack.org/641793

Changed in oslo.service:
assignee: nobody → Dan Smith (danms)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/641907

Changed in oslo.service:
assignee: Dan Smith (danms) → Mohammed Naser (mnaser)

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/641793
Reason: I still think the is-daemon check (and original justification) is wrong, but it's not the acute problem right now so I don't care enough to push on this.

Reviewed: https://review.openstack.org/641911
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=817c95fdac916f9b736308cc18f884ba61f502fa
Submitter: Zuul
Branch: master

commit 817c95fdac916f9b736308cc18f884ba61f502fa
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:26:42 2019 -0500

    nova: restart instead of reloading services

    There is a bug inside oslo.service which causes calls a stop,
    reset and start instead of reset only on SIGHUP which causes
    a very inconsistent state of services.

    Until this is resolved, we should not reload services but restart
    them only. This patch should be reverted once the related bug
    is fixed.

    This patch also removes SUSE jobs from check/gate and into the
    experimental queue as they do not have full maintainers at the
    moment.

    In addition, it moves jobs that have been failing for a long time
    such as the aio_distro_lxc to experimental as well, as they have
    been problematic.

    Change-Id: I61e340a4ef5f0980e8c36611c0f78cd06946636b
    Related-Bug: #1715374

Change abandoned by Guilherme Steinmuller Pimentel (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/642846

Reviewed: https://review.openstack.org/642842
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=02ca9bc374423567e695a075bbfbecef506c86cd
Submitter: Zuul
Branch: stable/rocky

commit 02ca9bc374423567e695a075bbfbecef506c86cd
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:26:42 2019 -0500

    nova: restart instead of reloading services

    There is a bug inside oslo.service which causes calls a stop,
    reset and start instead of reset only on SIGHUP which causes
    a very inconsistent state of services.

    Until this is resolved, we should not reload services but restart
    them only. This patch should be reverted once the related bug
    is fixed.

    Depends-On: If53b59f3faede91e158cd544742ab37639800b57

    Change-Id: I61e340a4ef5f0980e8c36611c0f78cd06946636b
    Related-Bug: #1715374
    (cherry picked from commit 817c95fdac916f9b736308cc18f884ba61f502fa)

Matt Riedemann (mriedem) wrote :

I created a train devstack today:

commit 683454f319246c3bda088b5e0325c5673263ce08 (HEAD -> master, origin/master, origin/HEAD)
Merge: 13e260ea 7224a6b5
Author: Zuul <email address hidden>
Date: Wed Apr 3 06:39:52 2019 +0000

    Merge "Update docs index page"

And created a test VM, it was fine. Then I SIGHUP'ed <email address hidden>:

sudo systemctl kill -s HUP <email address hidden>

I watched the n-cpu logs to make sure it was doing the reset() logic as expected, which it did. Then I tried creating another server which failed in privsep calls in the libvirt driver:

http://paste.openstack.org/show/748820/

Then when trying to cleanup and unplug VIFs privsep failed again:

http://paste.openstack.org/show/748821/

I couldn't get past this until I restarted <email address hidden>.

So unless there is some weird configuration with privsep + nova-compute in devstack, we have bigger issues than just temporarily not getting network-vif-plugged callbacks registered, but I don't know if the privsep stuff is a recent regression or not in this flow.

Matt Riedemann (mriedem) wrote :

stack@train:~$ grep privsep /etc/nova/rootwrap.d/compute.filters
# os_brick.privileged.default oslo.privsep context
privsep-rootwrap-os_brick: RegExpFilter, privsep-helper, root, privsep-helper, --config-file, /etc/(?!\.\.).*, --privsep_context, os_brick.privileged.default, --privsep_sock_path, /tmp/.*
privsep-rootwrap-sys_admin: RegExpFilter, privsep-helper, root, privsep-helper, --config-file, /etc/(?!\.\.).*, --privsep_context, nova.privsep.sys_admin_pctxt, --privsep_sock_path, /tmp/.*
stack@train:~$
stack@train:~$ sudo systemctl status <email address hidden>
● <email address hidden> - Devstack <email address hidden>
   Loaded: loaded (/<email address hidden>; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-04-03 17:05:17 UTC; 8min ago
 Main PID: 19990 (nova-compute)
    Tasks: 29 (limit: 4915)
   CGroup: /<email address hidden>
           ├─19990 /usr/bin/python /usr/local/bin/nova-compute --config-file /etc/nova/nova-cpu.conf
           ├─20063 /usr/bin/python /usr/local/bin/privsep-helper --config-file /etc/nova/nova-cpu.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /t
           └─20123 /usr/bin/python /usr/local/bin/privsep-helper --config-file /etc/nova/nova-cpu.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tm

Matt Riedemann (mriedem) wrote :

This is the <email address hidden> status after a full restart and after just a HUP:

http://paste.openstack.org/show/748822/

Note the privsep-helper child processes are gone on the SIGHUP.

And this is the n-cpu logs on the HUP:

http://paste.openstack.org/show/748823/

Note this:

Apr 03 17:15:28 train nova-compute[19990]: DEBUG oslo_privsep.comm [-] EOF on privsep read channel {{(pid=19990) _reader_main /usr/local/lib/python2.7/dist-packages/oslo_privsep/comm.py:145}}

Matt Riedemann (mriedem) wrote :

There isn't much interesting in the unit file either:

stack@train:~$ sudo systemctl cat <email address hidden>
# /<email address hidden>

[Unit]
Description = Devstack <email address hidden>

[Service]
Group = libvirt
ExecReload = /bin/kill -HUP $MAINPID
TimeoutStopSec = 300
KillMode = process
ExecStart = /usr/local/bin/nova-compute --config-file /etc/nova/nova-cpu.conf
User = stack

[Install]
WantedBy = multi-user.target

Matt Riedemann (mriedem) wrote :

FWIW the privsep SIGHUP issue with nova-compute isn't a regression in Stein, I was able to reproduce the issue with a stable/rocky devstack environment.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers