nova-compute daemon failing to vif_plug on Bionic/Queens after do-release-upgrade from Xenial

Bug #1928238 reported by Drew Freiberger
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
New
Undecided
Unassigned
systemd
New
Undecided
Unassigned

Bug Description

We are finding that upon Ubuntu series upgrade from xenial to bionic when running Queens nova-compute (kvm/qemu hypervisor) units with the 21.04 openstack charms, that nova-compute service is not waiting for OVS to be online before starting.

Looking at the systemd dependency tree, it appears that nova-compute.service's "After=[...] neutron-ovs-cleanup.service" is not being honored as there is no "Wants/Requires" of the same.

ubuntu@comp001:/etc/rc3.d$ systemctl list-dependencies nova-compute.service --reverse
nova-compute.service
● └─multi-user.target
● └─graphical.target

ubuntu@comp001:/etc/rc3.d$ systemctl list-dependencies neutron-ovs-cleanup.service --reverse
neutron-ovs-cleanup.service
● ├─neutron-openvswitch-agent.service
● └─multi-user.target
● └─graphical.target

ubuntu@comp001:~$ cat /etc/systemd/system/multi-user.target.wants/nova-compute.service
[Unit]
Description=OpenStack Compute
After=libvirtd.service postgresql.service mysql.service keystone.service rabbitmq-server.service ntp.service neutron-ovs-cleanup.service
*snip*

ubuntu@comp001:~$ cat /etc/systemd/system/multi-user.target.wants/neutron-ovs-cleanup.service
[Unit]
Description=OpenStack Neutron OVS cleanup
After=openvswitch-switch.service

ubuntu@comp001:~$ cat /etc/systemd/system/multi-user.target.wants/openvswitch-switch.service
[Unit]
Description=Open vSwitch
Before=network.target
After=network-pre.target ovsdb-server.service ovs-vswitchd.service
PartOf=network.target
Requires=ovsdb-server.service
Requires=ovs-vswitchd.service
*snip*

ubuntu@comp001:~$ journalctl -u ovsdb-server -S 2021-05-12
*snip*
May 12 14:33:53 comp001 systemd[1]: Started Open vSwitch Database Unit.
*snip*

ubuntu@comp001:~$ journalctl -u nova-compute -S 2021-05-12
*snip*
-- Reboot --
May 12 14:33:16 comp001 systemd[1]: Started OpenStack Compute.
May 12 14:33:20 comp001 sudo[145958]: nova : TTY=unknown ; PWD=/var/lib/nova ; USER=root ; COMMAND=/usr/bin/nova-rootwrap /etc/nova/rootwrap.conf privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpneYYST/privsep.sock
May 12 14:33:20 comp001 sudo[145958]: pam_unix(sudo:session): session opened for user root by (uid=0)
May 12 14:33:21 comp001 sudo[145958]: pam_unix(sudo:session): session closed for user root
May 12 14:33:22 comp001 ovs-vsctl[146707]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-br br-int -- set Bridge br-int datapath_type=system
May 12 14:33:22 comp001 ovs-vsctl[146707]: ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

*nova-compute-crashed, restarted by operator*

May 12 14:37:01 comp001 systemd[1]: Started OpenStack Compute.
May 12 14:37:04 comp001 sudo[224716]: nova : TTY=unknown ; PWD=/var/lib/nova ; USER=root ; COMMAND=/usr/bin/nova-rootwrap /etc/nova/rootwrap.conf privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpSmCq9d/privsep.sock
May 12 14:37:04 comp001 sudo[224716]: pam_unix(sudo:session): session opened for user root by (uid=0)
May 12 14:37:04 comp001 sudo[224716]: pam_unix(sudo:session): session closed for user root
May 12 14:37:05 comp001 ovs-vsctl[224795]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=120 -- --if-exists del-port br-int qvoacd8b7cf-e2
*snip*

Revision history for this message
Drew Freiberger (afreiberger) wrote :

This almost seems like a systemd bug, since "After" should imply "Wants" without it needing to be specified (as you can see in the 'systemctl list-dependencies neutron-ovs-cleanup', After= is honored for that service as a pre-req.

Revision history for this message
Drew Freiberger (afreiberger) wrote :
tags: added: cold-boot
tags: added: series-upgrade
Revision history for this message
Drew Freiberger (afreiberger) wrote :

This may be correlated with the paused systemd units from the series-upgrade prepare:

# systemctl status nova-compute.service openvswitch-switch.service
● nova-compute.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)

● openvswitch-switch.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)

Upon boot, these are the settings of the services, and then when we run complete, it enables these, but the ordering may not be re-calculated properly.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

It seems that nova-compute is started by the post-series-upgrade hooks while ovs service is still masked.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

$ juju upgrade-series 78 complete
machine-78 complete phase started
machine-78 started unit agents after series upgrade
ntp/124 post-series-upgrade hook running
ntp/124 post-series-upgrade completed
ceph-osd/18 post-series-upgrade hook running
ceph-osd/18 post-series-upgrade completed
filebeat/853 post-series-upgrade hook running
filebeat/853 post-series-upgrade completed
nova-compute-kvm/16 post-series-upgrade hook running
nova-compute-kvm/16 post-series-upgrade completed
hw-health/7 post-series-upgrade hook running
hw-health/7 post-series-upgrade completed
telegraf/859 post-series-upgrade hook running
telegraf/859 post-series-upgrade completed
ceilometer-agent/61 post-series-upgrade hook running
ceilometer-agent/61 post-series-upgrade completed
neutron-openvswitch/59 post-series-upgrade hook running

As you can see, n-ovs post-series-upgrade runs after nova-compute-kvm finishes. This is likely a juju bug that should potentially perform post-series-upgrade hooks for subordinates before principal charms.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

This issue does appear to be randomized based on juju upgrade-series complete ordering. This order of neutron-openvswitch completing post-series-upgrade before nova-compute is not affected by this bug. There must be something non-deterministic about the order of the post-series-upgrade hooks firing from juju that causes this race in some instances and not others, however, I wonder if neutron-openvswitch should actually mask openvswitch-switch.service during series upgrade, or allow for ovs to continue running while upgrade and reboot occur since nova may not be the only network path consuming the OVS service.

juju upgrade-series 1 complete
machine-1 complete phase started
machine-1 started unit agents after series upgrade
ntp/19 post-series-upgrade hook running
ntp/19 post-series-upgrade completed
neutron-openvswitch/1 post-series-upgrade hook running
neutron-openvswitch/1 post-series-upgrade completed
nrpe-host/17 post-series-upgrade hook running
nrpe-host/17 post-series-upgrade hook not found, skipping
lldpd/6 post-series-upgrade hook running
lldpd/6 post-series-upgrade completed
ceph-osd/1 post-series-upgrade hook running
ceph-osd/1 post-series-upgrade completed
filebeat/16 post-series-upgrade hook running
filebeat/16 post-series-upgrade completed
ceilometer-agent/1 post-series-upgrade hook running
ceilometer-agent/1 post-series-upgrade completed
hw-health/11 post-series-upgrade hook running
hw-health/11 post-series-upgrade completed
nova-compute-kvm/1 post-series-upgrade hook running
nova-compute-kvm/1 post-series-upgrade completed
landscape-client/19 post-series-upgrade hook running
landscape-client/19 post-series-upgrade hook not found, skipping
canonical-livepatch/6 post-series-upgrade hook running
canonical-livepatch/6 post-series-upgrade completed
telegraf/19 post-series-upgrade hook running
telegraf/19 post-series-upgrade completed
machine-1 series upgrade complete

Revision history for this message
Drew Freiberger (afreiberger) wrote :

This is actually two bugs. One is 'upgrade-series complete' ordering issue which results in nova-compute completing post-series-upgrade hook and starting before neutron-openvswitch charm has completed. this is a nova-comptue charm bug.

The other issue is that systemd startup ordering is not properly reflecting the dependencies when using only the "After=" keyword.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.