deferred restarts not effective on focal-ussuri during system-wide package upgrades

Bug #1955498 reported by Drew Freiberger
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron Gateway Charm
Fix Released
Critical
Liam Young
OpenStack Neutron Open vSwitch Charm
Fix Released
Critical
Liam Young

Bug Description

When performing system package upgrades to latest focal-ussuri (nova-compute has openstack-origin=distro on series focal) this week while having the setting "enable-auto-restarts=false", we experienced overcloud networking outages resulting from openvswitch switch, db, and agent restarts.

The following packages were updated:

https://pastebin.ubuntu.com/p/6tW2X6yp8K/

The versions of the resultant nova/neutron/openvswitch packages are:

https://pastebin.ubuntu.com/p/DskMBDQRCx/

The /var/log/apt/term.log does list a deferred restart for the openvswitch-switch service resulting from the policy-rc.d file, but still many services were restarted one second before the deferred restart was logged.

I cannot find any reason this should have restarted, unless there was a service restart that triggered a cascade in systemd that didn't get passed through invoke-rc.d/policy-rc.d.

I've captured an sos report and will be filing it with Canonical Support to investigate further.

Tags: seg sts
summary: - deferred restarts not effective on focal-ussuri during openvswitch
- package upgrade
+ deferred restarts not effective on focal-ussuri during system-wide
+ package upgrades
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Note that the charmed unit status does properly denote:

Unit is ready. Services queued for restart: openvswitch-switch

Revision history for this message
Drew Freiberger (afreiberger) wrote (last edit ):

$ more charm-neutron-openvswitch-1c9d98e6-61e0-11ec-9461-2f2de9716d4b.deferred
action: restart
policy_requestor_name: neutron-openvswitch
policy_requestor_type: charm
reason: Package update
service: openvswitch-switch
timestamp: 1640037572

* this timestamp translates to 2021-12-20T21:59:32

But we have a service restart at 21:59:31 per systemd:

$ systemctl status openvswitch-switch
● openvswitch-switch.service - Open vSwitch
     Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2021-12-20 21:59:31 UTC; 16min ago
   Main PID: 51024 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 314572)
     Memory: 0B
     CGroup: /system.slice/openvswitch-switch.service

Dec 20 21:59:31 hostname systemd[1]: Starting Open vSwitch...
Dec 20 21:59:31 hostname systemd[1]: Finished Open vSwitch.

Before the restart, the ovs-vswitch.log also shows all interfaces and bridges being removed from the switch, which is why the outage lasts longer than a few seconds of switch restart. This seems to indicate to me that something greater than a service restart is happening during these Landscape system patches.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Preliminary analysis points to possible issue with netplan-ovs-cleanup being run, causing ovs bridges related to netplan changes to get taken down.

Revision history for this message
Brett Milford (brettmilford) wrote :

So I've been able to rule out involvement of netplan by replicating this issue more directly.

1) deploy focal-ussuri cloud
2) juju config neutron-openvswitch enable-auto-restarts=false
3) observe policy file created
root@juju-b8a8c1-ovs-restart-10:/home/ubuntu# cat /etc/policy-rc.d/charm-neutron-openvswitch.policy
# Managed by juju
blocked_actions:
  neutron-dhcp-agent:
  - restart
  - stop
  - try-restart
  neutron-metadata-agent:
  - restart
  - stop
  - try-restart
  neutron-openvswitch-agent:
  - restart
  - stop
  - try-restart
  openvswitch-switch:
  - restart
  - stop
  - try-restart
  ovs-vswitchd:
  - restart
  - stop
  - try-restart
  ovs-vswitchd-dpdk:
  - restart
  - stop
  - try-restart
  ovsdb-server:
  - restart
  - stop
  - try-restart
policy_requestor_name: neutron-openvswitch
policy_requestor_type: charm

4) increment and upload openvswitch package to ppa (https://launchpad.net/~brettmilford/+archive/ubuntu/lp1955498-openvswitch-focal)
5) install ppa on nova-compute unit
root@juju-b8a8c1-ovs-restart-10:/home/ubuntu# sudo add-apt-repository ppa:brettmilford/lp1955498-openvswitch-focal
6) upgrade openvswitch
oot@juju-b8a8c1-ovs-restart-10:/home/ubuntu# apt-get upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  openvswitch-common openvswitch-switch python3-openvswitch
3 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 2855 kB of archives.
After this operation, 0 B of additional disk space will be used.
...
7) observe deferred file created
root@juju-b8a8c1-ovs-restart-10:/home/ubuntu# cat /var/lib/policy-rc.d/charm-neutron-openvswitch-5e5ab802-78c6-11ec-8007-2b2c461941da.deferred
action: restart
policy_requestor_name: neutron-openvswitch
policy_requestor_type: charm
reason: Package update
service: openvswitch-switch
timestamp: 1642555392

root@juju-b8a8c1-ovs-restart-10:/home/ubuntu# date -d @1642555392
Wed Jan 19 01:23:12 UTC 2022

8) observe openvswitch service restarted
root@juju-b8a8c1-ovs-restart-10:/home/ubuntu# systemctl status openvswitch-switch
● openvswitch-switch.service - Open vSwitch
     Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2022-01-19 01:23:11 UTC; 14min ago
   Main PID: 102920 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 4682)
     Memory: 0B
     CGroup: /system.slice/openvswitch-switch.service

Jan 19 01:23:11 juju-b8a8c1-ovs-restart-10 systemd[1]: Starting Open vSwitch...
Jan 19 01:23:11 juju-b8a8c1-ovs-restart-10 systemd[1]: Finished Open vSwitch.

tags: added: seg sts
Changed in charm-neutron-openvswitch:
status: New → Confirmed
Changed in charm-neutron-openvswitch:
status: Confirmed → Triaged
importance: Undecided → Critical
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Thanks for the steps from @brettmilford (and the ppa), I was able to recreate this and confirm the issue.

Following an upgrade of the openvswitch package - I see this in the /var/log/policy-rc.d.log file:

2022-01-19 02:31:53,977 Permitting open-vm-tools restart
2022-01-19 02:31:55,202 Permitting open-vm-tools.service restart
2022-01-19 02:31:55,983 Permitting vgauth.service restart
2022-01-19 02:31:56,726 Permitting ovs-record-hostname.service restart
2022-01-19 02:31:58,020 restart of openvswitch-switch blocked by charm neutron-openvswitch

and I can see the openvswitch-switch service was restarted at 02:31:57

root@juju-c49180-zaza-e93a047791b2-8:/home/ubuntu# systemctl status openvswitch-switch
● openvswitch-switch.service - Open vSwitch
     Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2022-01-19 02:31:57 UTC; 8min ago
   Main PID: 66522 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 4677)
     Memory: 0B
     CGroup: /system.slice/openvswitch-switch.service

Jan 19 02:31:57 juju-c49180-zaza-e93a047791b2-8 systemd[1]: Starting Open vSwitch...
Jan 19 02:31:57 juju-c49180-zaza-e93a047791b2-8 systemd[1]: Finished Open vSwitch.

On my next unit, I added the ovs-record-hostname service to the policy file using the following stanza:

  ovs-record-hostname:
  - restart
  - stop
  - try-restart

Then I added @brettmilford's ppa and upgraded the openvswitch package. In the policy-rc.d.log file I observe the ovs-record-hostname.service was blocked by the charm:

022-01-19 02:35:25,470 Permitting open-vm-tools restart
2022-01-19 02:35:27,018 Permitting open-vm-tools.service restart
2022-01-19 02:35:28,024 Permitting vgauth.service restart
2022-01-19 02:35:29,027 restart of ovs-record-hostname.service blocked by charm neutron-openvswitch
2022-01-19 02:35:29,690 restart of openvswitch-switch blocked by charm neutron-openvswitch

And the service has not been restarted:

root@juju-c49180-zaza-e93a047791b2-9:/etc/policy-rc.d# systemctl status openvswitch-switch
● openvswitch-switch.service - Open vSwitch
     Loaded: loaded (/lib/systemd/system/openvswitch-switch.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2022-01-19 02:28:31 UTC; 15min ago
   Main PID: 62372 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 4677)
     Memory: 0B
     CGroup: /system.slice/openvswitch-switch.service

Jan 19 02:28:31 juju-c49180-zaza-e93a047791b2-9 systemd[1]: Starting Open vSwitch...
Jan 19 02:28:31 juju-c49180-zaza-e93a047791b2-9 systemd[1]: Finished Open vSwitch.

Revision history for this message
Billy Olsen (billy-olsen) wrote (last edit ):

Added neutron-gateway charm as well. neutron-openvswitch is triaged by recreate, and neutron-gateway by code inspection.

The new ovs-record-hostname service is not registered as a deferrable service in either of the charms. Refer to the following lines:

Neutron OpenvSwitch: https://opendev.org/openstack/charm-neutron-openvswitch/src/commit/9951beeff2f5df55ce84f1aca4e2038eff990b39/hooks/neutron_ovs_utils.py#L522
Neutron Gateway: https://opendev.org/openstack/charm-neutron-gateway/src/commit/cdc744ee2107483c3530a3617933da1c0ba85bcf/hooks/neutron_utils.py#L839

Changed in charm-neutron-gateway:
status: New → Triaged
importance: Undecided → Critical
Liam Young (gnuoy)
Changed in charm-neutron-gateway:
assignee: nobody → Liam Young (gnuoy)
Changed in charm-neutron-openvswitch:
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-openvswitch (master)
Changed in charm-neutron-openvswitch:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-gateway (master)
Changed in charm-neutron-gateway:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-gateway (master)

Reviewed: https://review.opendev.org/c/openstack/charm-neutron-gateway/+/825302
Committed: https://opendev.org/openstack/charm-neutron-gateway/commit/e72659126ddb8923124f3e9157efc50221529d5f
Submitter: "Zuul (22348)"
Branch: master

commit e72659126ddb8923124f3e9157efc50221529d5f
Author: Billy Olsen <email address hidden>
Date: Wed Jan 19 04:59:35 2022 -0700

    Add ovs-record-hostname to deferable service list

    The ovs-record-hostname service was introduced in the openvswitch SRU
    for bug #1915829, however this service did not make the deferable
    services list for the neutron-gateway charm. This causes package
    upgrades to restart the openvswitch-switch service. Add the
    ovs-record-hostname to the deferable services list in order to prevent
    unintended restarts of openvswitch-switch.

    Closes-Bug: #1955498
    Change-Id: I24a32f6f5a5c51b8b8ee62f88a973a126106fcd9

Changed in charm-neutron-gateway:
status: In Progress → Fix Committed
Revision history for this message
Felipe Reyes (freyes) wrote :
Changed in charm-neutron-gateway:
milestone: none → 22.04
Changed in charm-neutron-openvswitch:
milestone: none → 22.04
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-neutron-openvswitch (master)

Change abandoned by "Billy Olsen <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/825301
Reason: Handled in other patch set

Changed in charm-neutron-openvswitch:
status: Fix Committed → Fix Released
Changed in charm-neutron-gateway:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.