Reloading compute with SIGHUP prevents instances from booting

Bug #1715374 reported by Sahid Orentino
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Ralf Haferkamp
OpenStack-Ansible
Fix Released
Critical
Mohammed Naser
oslo.service
Fix Released
Undecided
Ben Nemec
tripleo
Won't Fix
Critical
Emilien Macchi

Bug Description

When trying to boot a new instance at a compute-node, where nova-compute received SIGHUP(the SIGHUP is used as a trigger for reloading mutable options), it always failed.

  ========== nova/compute/manager.py ==============
    def cancel_all_events(self):
        if self._events is None:
            LOG.debug('Unexpected attempt to cancel events during shutdown.')
            return
        our_events = self._events
        # NOTE(danms): Block new events
        self._events = None <--- Set self._events to "None"
    ...
    =================================================

  This will cause a NovaException when prepare_for_instance_event() was called.
  It's the cause of the failure of network allocation.

    ========== nova/compute/manager.py ==============
    def prepare_for_instance_event(self, instance, event_name):
        ...
        if self._events is None:
            # NOTE(danms): We really should have a more specific error
            # here, but this is what we use for our default error case
            raise exception.NovaException('In shutdown, no new events '
                                          'can be scheduled')
    =================================================

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
tags: added: compute
tags: added: low-hanging-fruit
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :
Changed in tripleo:
status: New → In Progress
importance: Undecided → Critical
milestone: none → stein-1
assignee: nobody → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/601544

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/601558

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/596277
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=305445bac6ccf2cb491f451412035f08aa2928ba
Submitter: Zuul
Branch: master

commit 305445bac6ccf2cb491f451412035f08aa2928ba
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200

    Add container runtime packages for cron image

    Logrotate-crond needs docker binary (from package) and docker
    socket to restart some non-SIGHUP friendly containers.

    Related-Bug: #1715374
    Related-Bug: #1780139

    Change-Id: I73e7bb6e5ba4a79a3158a1878bf8b26a539e3e2f
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/604006

wes hayutin (weshayutin)
tags: added: alert
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.openstack.org/604006
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=fc31a1c649cca06d0340a15a55339757ff002775
Submitter: Zuul
Branch: stable/rocky

commit fc31a1c649cca06d0340a15a55339757ff002775
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200

    Add container runtime packages for cron image

    Logrotate-crond needs docker binary (from package) and docker
    socket to restart some non-SIGHUP friendly containers.

    Related-Bug: #1715374
    Related-Bug: #1780139

    Change-Id: I73e7bb6e5ba4a79a3158a1878bf8b26a539e3e2f
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 305445bac6ccf2cb491f451412035f08aa2928ba)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/604708

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Reloading compute with SIGHUP prenvents instances to boot
tags: removed: low-hanging-fruit
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Emilien Macchi (emilienm)
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
status: In Progress → Fix Released
status: Fix Released → Triaged
Revision history for this message
Emilien Macchi (emilienm) wrote :

(and the backports)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/596275
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Submitter: Zuul
Branch: master

commit d37c74d6382690a05ba3ca10edd7f2acb0fbbb2e
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:15:51 2018 +0200

    Fix postrotate to notify holders of rotated logs

    Lsof +L1 locates unlinked and open files and does not work for
    logrotate, neither with copyteuncate not w/o that option.

    Instead, find *.X (X - number) files held and notify the processes
    owning those to make an apropriate actions and reopen new log files to
    stop writing to the rotated files.

    The actions to be taken by such processes are:

    * For httpd processes, use USR1 to gracefully reload
    * For neutron-server, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1780139).
    * For nova-compute, restart the container as it cannot process
      HUP signal well (LP bug #1276694, LP bug #1715374).
    * For other processes, use HUP to reload

    This also fixes the filter to match logfiles ending with *err,
    like rabbitmq startup errors log.

    Closes-Bug: #1780139
    Closes-Bug: #1785659
    Closes-Bug: #1715374

    Change-Id: I5110426aa26e5fce7ebb4d80d8a2082cbf80519c
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/606864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/606908

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/pike)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/604708
Reason: not needed any more as we fallback to copytruncate

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/queens)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/601558
Reason: not needed any more as we fallback to copytruncate

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/queens)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/601544
Reason: implementation falls back to copytruncate making this not needed any more

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/pike)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/606908
Reason: implementation falls back to copytruncate making this not needed any more

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/rocky)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/606864
Reason: implementation falls back to copytruncate making this not needed any more

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.0.0

This issue was fixed in the openstack/puppet-tripleo 10.0.0 release.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Reloading compute with SIGHUP prenvents instances to boot

We can't fix that for tripleo as it requires nova fixes for SIGHUP processing, so marked it invalid (for tripleo). Logrotate configs had been switched from sending SIGHUP to using copytruncate, so tripleo is not affected with this anymore.

Changed in tripleo:
status: Fix Released → Invalid
Changed in tripleo:
status: Invalid → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by sahid (<email address hidden>) on branch: master
Review: https://review.openstack.org/420026

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/633220

Revision history for this message
Ralf Haferkamp (rhafer) wrote : Re: Reloading compute with SIGHUP prenvents instances to boot

I've just resubmitted the old patch (with smaller changes to address the feedback) as this but can cause serious issue for upgrades when following the official documentation.

Matt Riedemann (mriedem)
Changed in nova:
status: Confirmed → In Progress
assignee: nobody → Ralf Haferkamp (rhafer)
importance: Low → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Ralf Haferkamp (<email address hidden>) on branch: master
Review: https://review.openstack.org/633220

Jay Pipes (jaypipes)
summary: - Reloading compute with SIGHUP prenvents instances to boot
+ Reloading compute with SIGHUP prevents instances from booting
Mohammed Naser (mnaser)
Changed in openstack-ansible:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Mohammed Naser (mnaser)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.service (master)

Fix proposed to branch: master
Review: https://review.openstack.org/641793

Changed in oslo.service:
assignee: nobody → Dan Smith (danms)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/641907

Changed in oslo.service:
assignee: Dan Smith (danms) → Mohammed Naser (mnaser)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/641911

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.service (master)

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/641793
Reason: I still think the is-daemon check (and original justification) is wrong, but it's not the acute problem right now so I don't care enough to push on this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/641911
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=817c95fdac916f9b736308cc18f884ba61f502fa
Submitter: Zuul
Branch: master

commit 817c95fdac916f9b736308cc18f884ba61f502fa
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:26:42 2019 -0500

    nova: restart instead of reloading services

    There is a bug inside oslo.service which causes calls a stop,
    reset and start instead of reset only on SIGHUP which causes
    a very inconsistent state of services.

    Until this is resolved, we should not reload services but restart
    them only. This patch should be reverted once the related bug
    is fixed.

    This patch also removes SUSE jobs from check/gate and into the
    experimental queue as they do not have full maintainers at the
    moment.

    In addition, it moves jobs that have been failing for a long time
    such as the aio_distro_lxc to experimental as well, as they have
    been problematic.

    Change-Id: I61e340a4ef5f0980e8c36611c0f78cd06946636b
    Related-Bug: #1715374

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/642842

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/642846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on openstack-ansible (stable/queens)

Change abandoned by Guilherme Steinmuller Pimentel (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/642846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible (stable/rocky)

Reviewed: https://review.openstack.org/642842
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=02ca9bc374423567e695a075bbfbecef506c86cd
Submitter: Zuul
Branch: stable/rocky

commit 02ca9bc374423567e695a075bbfbecef506c86cd
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:26:42 2019 -0500

    nova: restart instead of reloading services

    There is a bug inside oslo.service which causes calls a stop,
    reset and start instead of reset only on SIGHUP which causes
    a very inconsistent state of services.

    Until this is resolved, we should not reload services but restart
    them only. This patch should be reverted once the related bug
    is fixed.

    Depends-On: If53b59f3faede91e158cd544742ab37639800b57

    Change-Id: I61e340a4ef5f0980e8c36611c0f78cd06946636b
    Related-Bug: #1715374
    (cherry picked from commit 817c95fdac916f9b736308cc18f884ba61f502fa)

Revision history for this message
Matt Riedemann (mriedem) wrote :

I created a train devstack today:

commit 683454f319246c3bda088b5e0325c5673263ce08 (HEAD -> master, origin/master, origin/HEAD)
Merge: 13e260ea 7224a6b5
Author: Zuul <email address hidden>
Date: Wed Apr 3 06:39:52 2019 +0000

    Merge "Update docs index page"

And created a test VM, it was fine. Then I SIGHUP'ed <email address hidden>:

sudo systemctl kill -s HUP <email address hidden>

I watched the n-cpu logs to make sure it was doing the reset() logic as expected, which it did. Then I tried creating another server which failed in privsep calls in the libvirt driver:

http://paste.openstack.org/show/748820/

Then when trying to cleanup and unplug VIFs privsep failed again:

http://paste.openstack.org/show/748821/

I couldn't get past this until I restarted <email address hidden>.

So unless there is some weird configuration with privsep + nova-compute in devstack, we have bigger issues than just temporarily not getting network-vif-plugged callbacks registered, but I don't know if the privsep stuff is a recent regression or not in this flow.

Revision history for this message
Matt Riedemann (mriedem) wrote :

stack@train:~$ grep privsep /etc/nova/rootwrap.d/compute.filters
# os_brick.privileged.default oslo.privsep context
privsep-rootwrap-os_brick: RegExpFilter, privsep-helper, root, privsep-helper, --config-file, /etc/(?!\.\.).*, --privsep_context, os_brick.privileged.default, --privsep_sock_path, /tmp/.*
privsep-rootwrap-sys_admin: RegExpFilter, privsep-helper, root, privsep-helper, --config-file, /etc/(?!\.\.).*, --privsep_context, nova.privsep.sys_admin_pctxt, --privsep_sock_path, /tmp/.*
stack@train:~$
stack@train:~$ sudo systemctl status <email address hidden>
● <email address hidden> - Devstack <email address hidden>
   Loaded: loaded (/<email address hidden>; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2019-04-03 17:05:17 UTC; 8min ago
 Main PID: 19990 (nova-compute)
    Tasks: 29 (limit: 4915)
   CGroup: /<email address hidden>
           ├─19990 /usr/bin/python /usr/local/bin/nova-compute --config-file /etc/nova/nova-cpu.conf
           ├─20063 /usr/bin/python /usr/local/bin/privsep-helper --config-file /etc/nova/nova-cpu.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /t
           └─20123 /usr/bin/python /usr/local/bin/privsep-helper --config-file /etc/nova/nova-cpu.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tm

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is the <email address hidden> status after a full restart and after just a HUP:

http://paste.openstack.org/show/748822/

Note the privsep-helper child processes are gone on the SIGHUP.

And this is the n-cpu logs on the HUP:

http://paste.openstack.org/show/748823/

Note this:

Apr 03 17:15:28 train nova-compute[19990]: DEBUG oslo_privsep.comm [-] EOF on privsep read channel {{(pid=19990) _reader_main /usr/local/lib/python2.7/dist-packages/oslo_privsep/comm.py:145}}

Revision history for this message
Matt Riedemann (mriedem) wrote :

There isn't much interesting in the unit file either:

stack@train:~$ sudo systemctl cat <email address hidden>
# /<email address hidden>

[Unit]
Description = Devstack <email address hidden>

[Service]
Group = libvirt
ExecReload = /bin/kill -HUP $MAINPID
TimeoutStopSec = 300
KillMode = process
ExecStart = /usr/local/bin/nova-compute --config-file /etc/nova/nova-cpu.conf
User = stack

[Install]
WantedBy = multi-user.target

Revision history for this message
Matt Riedemann (mriedem) wrote :

FWIW the privsep SIGHUP issue with nova-compute isn't a regression in Stein, I was able to reproduce the issue with a stable/rocky devstack environment.

Changed in oslo.service:
assignee: Mohammed Naser (mnaser) → Ben Nemec (bnemec)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.privsep 1.33.2

This issue was fixed in the openstack/oslo.privsep 1.33.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/679974

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/679974
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=73f1fda7e93325d3ca2b2c62afb17fe011d650f3
Submitter: Zuul
Branch: master

commit 73f1fda7e93325d3ca2b2c62afb17fe011d650f3
Author: Eric Fried <email address hidden>
Date: Wed Sep 4 06:47:48 2019 -0500

    Bump min for oslo.service & .privsep to fix SIGHUP

    The combined fixes for the two related bugs resolve the problem where
    SIGHUP breaks the nova-compute service. Bump the minimum requirements
    for oslo.privsep and oslo.service to make sure these fixes are in place,
    and add a reno to advertise resolution of the issue.

    This also bumps oslo.utils to match the lower constraint from
    oslo.service.

    Change-Id: I39ead744b21a4423352a88573f327273e4d09630
    Related-Bug: #1794708
    Related-Bug: #1715374

Revision history for this message
sean mooney (sean-k-mooney) wrote :

as of https://github.com/openstack/nova/commit/73f1fda7e93325d3ca2b2c62afb17fe011d650f3
this is now fix in nova 20.0.0.0rc1

the osls.service issue was fixed in release 1.40.1

Changed in nova:
status: In Progress → Fix Released
Changed in oslo.service:
status: In Progress → Fix Released
pkrev (pkrev-design)
Changed in openstack-ansible:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.privsep 1.32.2

This issue was fixed in the openstack/oslo.privsep 1.32.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.