Reloading compute with SIGHUP prevents instances from booting
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| OpenStack Compute (nova) |
High
|
Ralf Haferkamp | ||
| openstack-ansible |
Critical
|
Mohammed Naser | ||
| oslo.service |
Undecided
|
Ben Nemec | ||
| tripleo |
Critical
|
Emilien Macchi |
Bug Description
When trying to boot a new instance at a compute-node, where nova-compute received SIGHUP(the SIGHUP is used as a trigger for reloading mutable options), it always failed.
========== nova/compute/
def cancel_
if self._events is None:
return
our_events = self._events
# NOTE(danms): Block new events
...
===
This will cause a NovaException when prepare_
It's the cause of the failure of network allocation.
========== nova/compute/
def prepare_
...
if self._events is None:
# NOTE(danms): We really should have a more specific error
# here, but this is what we use for our default error case
raise exception.
===
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → Low |
tags: | added: compute |
tags: | added: low-hanging-fruit |
Sahid Orentino (sahid-ferdjaoui) wrote : | #1 |
Changed in tripleo: | |
status: | New → In Progress |
importance: | Undecided → Critical |
milestone: | none → stein-1 |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
Fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/queens) | #3 |
Related fix proposed to branch: stable/queens
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 305445bac6ccf2c
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200
Add container runtime packages for cron image
Logrotate-crond needs docker binary (from package) and docker
socket to restart some non-SIGHUP friendly containers.
Related-Bug: #1715374
Related-Bug: #1780139
Change-Id: I73e7bb6e5ba4a7
Signed-off-by: Bogdan Dobrelya <email address hidden>
Related fix proposed to branch: stable/rocky
Review: https:/
tags: | added: alert |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit fc31a1c649cca06
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:42:49 2018 +0200
Add container runtime packages for cron image
Logrotate-crond needs docker binary (from package) and docker
socket to restart some non-SIGHUP friendly containers.
Related-Bug: #1715374
Related-Bug: #1780139
Change-Id: I73e7bb6e5ba4a7
Signed-off-by: Bogdan Dobrelya <email address hidden>
(cherry picked from commit 305445bac6ccf2c
tags: | added: in-stable-rocky |
Related fix proposed to branch: stable/pike
Review: https:/
Remaining patches to close this bug:
* https:/
* https:/
tags: | removed: low-hanging-fruit |
Changed in tripleo: | |
assignee: | Bogdan Dobrelya (bogdando) → Emilien Macchi (emilienm) |
Emilien Macchi (emilienm) wrote : | #9 |
remaining patch : https:/
Changed in tripleo: | |
status: | In Progress → Fix Released |
status: | Fix Released → Triaged |
Emilien Macchi (emilienm) wrote : | #10 |
(and the backports)
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit d37c74d6382690a
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Aug 24 12:15:51 2018 +0200
Fix postrotate to notify holders of rotated logs
Lsof +L1 locates unlinked and open files and does not work for
logrotate, neither with copyteuncate not w/o that option.
Instead, find *.X (X - number) files held and notify the processes
owning those to make an apropriate actions and reopen new log files to
stop writing to the rotated files.
The actions to be taken by such processes are:
* For httpd processes, use USR1 to gracefully reload
* For neutron-server, restart the container as it cannot process
HUP signal well (LP bug #1276694, LP bug #1780139).
* For nova-compute, restart the container as it cannot process
HUP signal well (LP bug #1276694, LP bug #1715374).
* For other processes, use HUP to reload
This also fixes the filter to match logfiles ending with *err,
like rabbitmq startup errors log.
Closes-Bug: #1780139
Closes-Bug: #1785659
Closes-Bug: #1715374
Change-Id: I5110426aa26e5f
Signed-off-by: Bogdan Dobrelya <email address hidden>
Changed in tripleo: | |
status: | Triaged → Fix Released |
Fix proposed to branch: stable/rocky
Review: https:/
Fix proposed to branch: stable/pike
Review: https:/
Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https:/
Reason: not needed any more as we fallback to copytruncate
Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https:/
Reason: not needed any more as we fallback to copytruncate
Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/queens
Review: https:/
Reason: implementation falls back to copytruncate making this not needed any more
Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/pike
Review: https:/
Reason: implementation falls back to copytruncate making this not needed any more
Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/rocky
Review: https:/
Reason: implementation falls back to copytruncate making this not needed any more
This issue was fixed in the openstack/
Bogdan Dobrelya (bogdando) wrote : Re: Reloading compute with SIGHUP prenvents instances to boot | #20 |
We can't fix that for tripleo as it requires nova fixes for SIGHUP processing, so marked it invalid (for tripleo). Logrotate configs had been switched from sending SIGHUP to using copytruncate, so tripleo is not affected with this anymore.
Changed in tripleo: | |
status: | Fix Released → Invalid |
Changed in tripleo: | |
status: | Invalid → Won't Fix |
Change abandoned by sahid (<email address hidden>) on branch: master
Review: https:/
Fix proposed to branch: master
Review: https:/
I've just resubmitted the old patch (with smaller changes to address the feedback) as this but can cause serious issue for upgrades when following the official documentation.
Changed in nova: | |
status: | Confirmed → In Progress |
assignee: | nobody → Ralf Haferkamp (rhafer) |
importance: | Low → High |
Change abandoned by Ralf Haferkamp (<email address hidden>) on branch: master
Review: https:/
summary: |
- Reloading compute with SIGHUP prenvents instances to boot + Reloading compute with SIGHUP prevents instances from booting |
Changed in openstack-ansible: | |
status: | New → Confirmed |
importance: | Undecided → Critical |
assignee: | nobody → Mohammed Naser (mnaser) |
Fix proposed to branch: master
Review: https:/
Changed in oslo.service: | |
assignee: | nobody → Dan Smith (danms) |
status: | New → In Progress |
Fix proposed to branch: master
Review: https:/
Changed in oslo.service: | |
assignee: | Dan Smith (danms) → Mohammed Naser (mnaser) |
Related fix proposed to branch: master
Review: https:/
Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https:/
Reason: I still think the is-daemon check (and original justification) is wrong, but it's not the acute problem right now so I don't care enough to push on this.
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 817c95fdac916f9
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:26:42 2019 -0500
nova: restart instead of reloading services
There is a bug inside oslo.service which causes calls a stop,
reset and start instead of reset only on SIGHUP which causes
a very inconsistent state of services.
Until this is resolved, we should not reload services but restart
them only. This patch should be reverted once the related bug
is fixed.
This patch also removes SUSE jobs from check/gate and into the
experimental queue as they do not have full maintainers at the
moment.
In addition, it moves jobs that have been failing for a long time
such as the aio_distro_lxc to experimental as well, as they have
been problematic.
Change-Id: I61e340a4ef5f09
Related-Bug: #1715374
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (stable/rocky) | #30 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to openstack-ansible (stable/queens) | #31 |
Related fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on openstack-ansible (stable/queens) | #32 |
Change abandoned by Guilherme Steinmuller Pimentel (<email address hidden>) on branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to openstack-ansible (stable/rocky) | #33 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 02ca9bc37442356
Author: Mohammed Naser <email address hidden>
Date: Thu Mar 7 22:26:42 2019 -0500
nova: restart instead of reloading services
There is a bug inside oslo.service which causes calls a stop,
reset and start instead of reset only on SIGHUP which causes
a very inconsistent state of services.
Until this is resolved, we should not reload services but restart
them only. This patch should be reverted once the related bug
is fixed.
Depends-On: If53b59f3faede9
Change-Id: I61e340a4ef5f09
Related-Bug: #1715374
(cherry picked from commit 817c95fdac916f9
Matt Riedemann (mriedem) wrote : | #34 |
I created a train devstack today:
commit 683454f319246c3
Merge: 13e260ea 7224a6b5
Author: Zuul <email address hidden>
Date: Wed Apr 3 06:39:52 2019 +0000
Merge "Update docs index page"
And created a test VM, it was fine. Then I SIGHUP'ed <email address hidden>:
sudo systemctl kill -s HUP <email address hidden>
I watched the n-cpu logs to make sure it was doing the reset() logic as expected, which it did. Then I tried creating another server which failed in privsep calls in the libvirt driver:
http://
Then when trying to cleanup and unplug VIFs privsep failed again:
http://
I couldn't get past this until I restarted <email address hidden>.
So unless there is some weird configuration with privsep + nova-compute in devstack, we have bigger issues than just temporarily not getting network-vif-plugged callbacks registered, but I don't know if the privsep stuff is a recent regression or not in this flow.
Matt Riedemann (mriedem) wrote : | #35 |
stack@train:~$ grep privsep /etc/nova/
# os_brick.
privsep-
privsep-
stack@train:~$
stack@train:~$ sudo systemctl status <email address hidden>
● <email address hidden> - Devstack <email address hidden>
Loaded: loaded (/<email address hidden>; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-04-03 17:05:17 UTC; 8min ago
Main PID: 19990 (nova-compute)
Tasks: 29 (limit: 4915)
CGroup: /<email address hidden>
├─19990 /usr/bin/python /usr/local/
├─20063 /usr/bin/python /usr/local/
└─20123 /usr/bin/python /usr/local/
Matt Riedemann (mriedem) wrote : | #36 |
This is the <email address hidden> status after a full restart and after just a HUP:
http://
Note the privsep-helper child processes are gone on the SIGHUP.
And this is the n-cpu logs on the HUP:
http://
Note this:
Apr 03 17:15:28 train nova-compute[
Matt Riedemann (mriedem) wrote : | #37 |
There isn't much interesting in the unit file either:
stack@train:~$ sudo systemctl cat <email address hidden>
# /<email address hidden>
[Unit]
Description = Devstack <email address hidden>
[Service]
Group = libvirt
ExecReload = /bin/kill -HUP $MAINPID
TimeoutStopSec = 300
KillMode = process
ExecStart = /usr/local/
User = stack
[Install]
WantedBy = multi-user.target
Matt Riedemann (mriedem) wrote : | #38 |
FWIW the privsep SIGHUP issue with nova-compute isn't a regression in Stein, I was able to reproduce the issue with a stable/rocky devstack environment.
Changed in oslo.service: | |
assignee: | Mohammed Naser (mnaser) → Ben Nemec (bnemec) |
This issue was fixed in the openstack/
Related fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 73f1fda7e93325d
Author: Eric Fried <email address hidden>
Date: Wed Sep 4 06:47:48 2019 -0500
Bump min for oslo.service & .privsep to fix SIGHUP
The combined fixes for the two related bugs resolve the problem where
SIGHUP breaks the nova-compute service. Bump the minimum requirements
for oslo.privsep and oslo.service to make sure these fixes are in place,
and add a reno to advertise resolution of the issue.
This also bumps oslo.utils to match the lower constraint from
oslo.service.
Change-Id: I39ead744b21a44
Related-Bug: #1794708
Related-Bug: #1715374
sean mooney (sean-k-mooney) wrote : | #42 |
as of https:/
this is now fix in nova 20.0.0.0rc1
the osls.service issue was fixed in release 1.40.1
Changed in nova: | |
status: | In Progress → Fix Released |
Changed in oslo.service: | |
status: | In Progress → Fix Released |
Changed in openstack-ansible: | |
status: | Confirmed → Fix Released |
This issue was fixed in the openstack/
https:/ /review. openstack. org/#/c/ 420026/