openvswitch systemd unit file ordering wrong

Bug #1448254 reported by Kevin Otte on 2015-04-24
84
This bug affects 15 people
Affects Status Importance Assigned to Milestone
One Hundred Papercuts
Critical
Unassigned
openvswitch (Ubuntu)
High
James Page
Declined for Vivid by James Page
Xenial
High
Unassigned
Yakkety
High
Unassigned
Zesty
High
James Page

Bug Description

[Impact]
systems with openvswitch management primary network devices take a long time to boot.

[Test Case]
Configure primary network interface for server using openvswitch eni syntax; reboot server (will take 4 minutes to boot).

[Regression Potential]
Medium risk - this is a change to the behaviour of the systemd units, but it does appear to have been validated by the wider community.

[Original Bug Report]
After upgrade to vivid, my system takes nearly 4 minutes to boot. This appears to be related to the new systemd unit ordering.

[Unit]
Description=Open vSwitch
After=network.target openvswitch-nonetwork.service
...

root@mystic:/lib/systemd/system# systemd-analyze blame | head
      2min 233ms ifup-wait-all-auto.service
...

Open vSwitch is being started after the network, but the network needs Open vSwitch to start since my host traffic is flowing through the bridge:

root@mystic:/lib/systemd/system# ovs-vsctl show
838a8aa4-4811-447d-8dcc-dbb675b78968
    Bridge "br0"
        Port "br0"
            tag: 1
            Interface "br0"
                type: internal
        Port "vlan121"
            tag: 121
            Interface "vlan121"
                type: internal
        Port "eth0"
            tag: 1
            Interface "eth0"
    ovs_version: "2.3.1"

The interfaces do eventually start correctly, but only after the long timeout above.

ProblemType: Bug
DistroRelease: Ubuntu 15.04
Package: openvswitch-switch 2.3.1-0ubuntu1
ProcVersionSignature: Ubuntu 3.19.0-15.15-generic 3.19.3
Uname: Linux 3.19.0-15-generic x86_64
NonfreeKernelModules: nvidia
ApportVersion: 2.17.2-0ubuntu1
Architecture: amd64
Date: Fri Apr 24 14:10:19 2015
EcryptfsInUse: Yes
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: openvswitch
UpgradeStatus: Upgraded to vivid on 2015-04-24 (0 days ago)

Kevin Otte (nivex) wrote :
Kevin Otte (nivex) wrote :

Looks we've been through this before with upstart:
https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1084028

Mark Dunn (mark-dunn-y) wrote :

I have a the same problem.
In systemd, there are two files openvswitch-nonetwork.service openvswitch-switch.service which are supposed to fix this problem

I am testing OpenStack with VXLAN and require

sudo ovs-vsctl add-br br-eth2
sudo ovs-vsctl set port br-eth2 tag=2001
sudo ovs-vsctl add-port br-eth2 vxlan1
sudo ovs-vsctl set interface vxlan1 type=vxlan options:remote_ip=192.168.102.205
sudo ovs-vsctl add-port br-eth2 vxlan2
sudo ovs-vsctl set interface vxlan2 type=vxlan options:remote_ip=192.168.102.234
sudo ovs-vsctl add-br br-ex
sudo ovs-vsctl add-port br-ex eth0

then bring up the network from /etc/network/interfaces with

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual
    up ifconfig $IFACE 0.0.0.0 up
    up ip link set $IFACE promisc on
    down ip link set $IFACE promisc off
    down ifconfig $IFACE down

auto br-ex
iface br-ex inet static
    address 192.168.102.206
    netmask 255.255.255.0
    gateway 192.168.102.36
    dns-nameservers 192.168.102.10 192.168.102.50
    up ip link set $IFACE promisc on
    down ip link set $IFACE promisc off

auto br-eth2
iface br-eth2 inet static
    address 10.1.0.11
    netmask 255.255.255.0
    up ip link set $IFACE promisc on
    down ip link set $IFACE promisc off
    mtu 1446

from "sudo systemctl list-unit-files" i get

openvswitch-nonetwork.service static
openvswitch-switch.service enabled

so I assume they are running

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openvswitch (Ubuntu):
status: New → Confirmed
Changed in openvswitch (Ubuntu):
importance: Undecided → Critical
Changed in hundredpapercuts:
status: New → Confirmed
importance: Undecided → Critical
Changed in openvswitch (Ubuntu):
status: Confirmed → Triaged
Changed in hundredpapercuts:
status: Confirmed → Triaged
Mark Dunn (mark-dunn-y) wrote :

Sorry, lost track of the bug as it fell into 100 papercuts...

if it helps I solved my ordering problem by modifying the
    /lib/systemd/sysytem/openvswitch-nonetwork.service
as follows

    [Unit]
    Description=Open vSwitch Internal Unit
    PartOf=openvswitch-switch.service

    # Without this all sorts of looping dependencies occur doh!
    DefaultDependencies=no

    #precedants pulled from isup@ service requirements
    After=apparmor.service local-fs.target systemd-tmpfiles-setup.service

    #subsequent to this service we need the network to start
    Wants=network-pre.target openvswitch-switch.service
    Before=network-pre.target openvswitch-switch.service

    [Service]
    Type=oneshot
    RemainAfterExit=yes
    EnvironmentFile=-/etc/default/openvswitch-switch
    ExecStart=/usr/share/openvswitch/scripts/ovs-ctl start \
          --system-id=random $OPTIONS
    ExecStop=/usr/share/openvswitch/scripts/ovs-ctl stop

This pulled up the services and allowed my configuration to work

openvswitch-nonetwork.service
● ├─openvswitch-switch.service
● └─network-pre.target
● ├─<email address hidden>
● ├─<email address hidden>
● ├─<email address hidden>
● ├─<email address hidden>
● ├─networking.service
● └─network.target
● ├─mysql.service
● ├─openvswitch-switch.service
● ├─rabbitmq-server.service
● ├─rc-local.service
● ├─ssh.service
● └─network-online.target
● ├─apache2.service
● ├─dns-clean.service
● └─kerneloops.service

(the dots are green :) )

There is lots of noise on the net about chickens and eggs, so I do not know if it solves some other case
The thing that took me so long to solve it (besides unfamiliarity of systemd) was the AUTOMATIC inclusion of dependencies, who on earth dreamt that up?

Brian Turek (brian-turek) wrote :

I was one of the people that helped get this fixed in Trusty and now, in a major case of deja vu, hit this same issue now with Xenial. It looks like Mark Dunn's fix above matches http://www.opencloudblog.com/?p=240 and I can confirm it fixed the problem on Xenial.

tags: added: xenial
James Page (james-page) on 2016-11-15
Changed in openvswitch (Ubuntu Yakkety):
status: New → Triaged
Changed in openvswitch (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in openvswitch (Ubuntu Yakkety):
importance: Undecided → High
Changed in openvswitch (Ubuntu Zesty):
importance: Critical → High
status: Triaged → Fix Committed
assignee: nobody → James Page (james-page)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openvswitch - 2.6.1-0ubuntu1

---------------
openvswitch (2.6.1-0ubuntu1) zesty; urgency=medium

  * New upstream point release (LP: #1641956).
  * d/openvswitch-switch.openvswitch-nonetwork.service: Update Unit
    definition to ensure that openvswitch starts prior to configuration
    of any network interfaces (LP: #1448254). Thanks to Mark Dunn for
    this fix.

 -- James Page <email address hidden> Tue, 15 Nov 2016 14:15:06 +0000

Changed in openvswitch (Ubuntu Zesty):
status: Fix Committed → Fix Released
Simon Leinen (simon-leinen) wrote :

Thanks for looking into this. Any chance of getting the fix backported to Xenial?

James Page (james-page) wrote :

@Simon

Working on the backports

Changed in openvswitch (Ubuntu Yakkety):
assignee: nobody → James Page (james-page)
status: Triaged → In Progress
James Page (james-page) on 2017-01-16
description: updated
James Page (james-page) wrote :

OK so I'm struggling a bit to produce a nice /e/n/i that reproduces the issue on a minimal/fresh install to help with SRU verification - any of the participants in this bug report have anything we can use for that?

James Page (james-page) on 2017-04-04
Changed in openvswitch (Ubuntu Yakkety):
assignee: James Page (james-page) → nobody
status: In Progress → Triaged
Brian Turek (brian-turek) wrote :

James,

This bug seems to be weird to reproduce. My physical Xenial server worked for months without the proposed patch then last week it suddenly stopped receiving DHCP lease 2 days ago. After much troubleshooting I remembered I was on this bug report, used the proposed patch, and it works like a charm. Without it my server would never try to get a DHCP lease on startup but ifup worked fine once the machine was up.

I tried to recreate using a VM but the VM has no such issue; it grabs a DHCP lease without any problems. I don't know if it's some weird race condition or what.

Nicolas SCHWARTZ (aurryon) wrote :

The bug is still there on ubuntu 16.04
In the journalctl logs openvswitch complains about connecting to its db while trying to raise NIC defined in the /etc/network/interfaces or /etc/network/interfaces.d/* and it slow down the process and create errors.

Steps to reproduce:
-Create OVS interface in the /etc/network/interfaces
E.g.:
allow-ovs main_sw
auto main_sw
iface main_sw inet manual
        ovs_type OVSBridge
-Reboot your computer
-Look at journalctl -b0 -r.

The new patched systemd unit nonetwork for zesty works well on xenial.

I'm also affected by this on ubuntu 16.04. The patched systemd-unit-file does not work for me.

With the original unit-file the systems starts fine and all bridges and physical interfaces are configured by openvswitch, but the systems hangs during shutdown.

With patched unit-file the physical interfaces do not get attached to the openvswitch-bridges, but the system shutdown is working fine. Looks like ifup@ is turning the interfaces on before openvswitch consumes them.

I tried playing with the unit-dependencies, but so far without luck.

Is there already a bug report for xenial regarding this issue? I think it is a good idea to open bug against xenial, don't you think?

sry, i was wrong. I had some issues in my /etc/network/interfaces-file which caused the patched unit-file to malfunction. To be more precise: i did put a line "auto <ifname>" above any device enslaved by openvswitch-bridges or -bonds later on. That forced ifup@ to turn the devices on before they were enslaved by ovs...

 The patched unit-file is working fine now.

Elias Abacioglu (raboo) wrote :

Can also confirm that Marks patch worked.
This was two years ago, why haven't this been fixed yet?

Elias Abacioglu (raboo) wrote :
Download full text (3.2 KiB)

I might as well just add some facts and my logs.

First of all, I have one Xenial node where open vswitch worked, guessing that that time open vswitch daemon started before trying to setup network.
And then on other machines it doesn't work cause the daemon isn't started before trying to setup networking.

Here are the logs from a broken system.
```
Sep 12 16:11:07 vm10 systemd[1]: Reached target Network.
Sep 12 16:11:07 vm10 systemd[1]: networking.service: Failed with result 'exit-code'.
Sep 12 16:11:07 vm10 systemd[1]: networking.service: Unit entered failed state.
Sep 12 16:11:07 vm10 systemd[1]: Failed to start Raise network interfaces.
Sep 12 16:11:07 vm10 systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 16:11:07 vm10 ntpdate[3027]: no servers can be used, exiting
Sep 12 16:11:07 vm10 ifup[2858]: Failed to bring up br0.
Sep 12 16:11:07 vm10 ifup[2858]: Cannot find device "br0"
Sep 12 16:11:07 vm10 ifup[2858]: Failed to bring up mgm.
Sep 12 16:11:07 vm10 ifup[2858]: Cannot find device "mgm"
Sep 12 16:11:07 vm10 ifup[2858]: mgm: ERROR while getting interface flags: No such device
Sep 12 16:11:07 vm10 ifup[2858]: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Sep 12 16:11:07 vm10 ovs-vsctl[2991]: ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Sep 12 16:11:07 vm10 ovs-vsctl[2991]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --may-exist add-port br0 mgm vlan_mode=access -- set Interface mgm type=internal -- set interface mgm external-ids:iface-id=vm10-mgm-vif
Sep 12 16:11:07 vm10 ifup[2858]: Failed to bring up bond0.
Sep 12 16:11:07 vm10 ifup[2858]: Cannot find device "bond0"
Sep 12 16:11:07 vm10 kernel: bnx2x 0000:01:00.1 em2: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
Sep 12 16:11:06 vm10 kernel: bnx2x 0000:01:00.1 em2: using MSI-X IRQs: sp 50 fp[0] 52 ... fp[7] 59
Sep 12 16:11:06 vm10 kernel: bnx2x 0000:01:00.0 em1: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
Sep 12 16:11:06 vm10 kernel: bnx2x 0000:01:00.0 em1: using MSI-X IRQs: sp 39 fp[0] 41 ... fp[7] 48
Sep 12 16:11:05 vm10 ifup[2858]: bond0: ERROR while getting interface flags: No such device
Sep 12 16:11:05 vm10 ifup[2858]: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Sep 12 16:11:05 vm10 ovs-vsctl[2922]: ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Sep 12 16:11:05 vm10 ovs-vsctl[2922]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --fake-iface add-bond br0 bond0 em1 em2 vlan_mode=native-untagged bond_mode=balance-tcp lacp=active other_config:lacp-time=fast --
Sep 12 16:11:05 vm10 ifup[2858]: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Sep 12 16:11:05 vm10 ovs-vsctl[2908]: ovs|00002|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Sep 12 16:11:05 vm10 ovs-vsctl[2908]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --may-ex...

Read more...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers