Daily cron restarts network on unattended updates but keepalived .service is not restarted as a dependency

Bug #1810583 reported by Tom Scholten on 2019-01-05
68
This bug affects 11 people
Affects Status Importance Assigned to Milestone
keepalived (Ubuntu)
High
Karl Stenerud
networkd-dispatcher (Ubuntu)
Undecided
Unassigned

Bug Description

[Impact]

If systemd-networkd is restarted, any VRRP from keepalived are not restored.

[Test Case]

multipass launch daily:bionic --name tester && multipass exec tester -- sudo su

apt update && apt dist-upgrade -y && apt install -y keepalived &&
echo "vrrp_instance VI_1 {
    virtual_router_id 33
    state MASTER
    interface ens3

    virtual_ipaddress {
        $(ip addr | grep 'inet ' | grep global | head -1 | sed 's/.*inet \([0-9]*\.[0-9]*\.[0-9]*\)\..*/\1.3/g')
    }
}" >/etc/keepalived/keepalived.conf &&
service keepalived start &&

# There will be a new IP address x.x.x.3/32 added to ens3
ip addr

# Restart networkd. The IP address won't come back
systemctl restart systemd-networkd
ip addr

# Restart keepalived. The IP address will come back
systemctl restart keepalived
ip addr

[Regression Potential]

TODO

[Original Description]

Description: Ubuntu 18.04.1 LTS
Release: 18.04
ii keepalived 1:1.3.9-1ubuntu0.18.04.1 amd64 Failover and monitoring daemon for LVS clusters

(From unanswered https://answers.launchpad.net/ubuntu/+source/keepalived/+question/676267)

Since two weeks we lost our keepalived VRRP address on on our of systems, closer inspection reveals that this was due to the daily cronjob.Apparently something triggered a udev reload (and last week the same seemed to happen) which obviously triggers a network restart.

Are we right in assuming the below patch is the correct way (and shouldn't this be in the default install of the systemd service of keepalived).

/etc/systemd/system/multi-user.target.wants/keepalived.service:
--- keepalived.service.orig 2018-11-20 09:17:06.973924706 +0100
+++ keepalived.service 2018-11-20 09:05:55.984773226 +0100
@@ -4,6 +4,7 @@
 Wants=network-online.target
 # Only start if there is a configuration file
 ConditionFileNotEmpty=/etc/keepalived/keepalived.conf
+PartOf=systemd-networkd.service

Accompanying syslog:
Nov 20 06:34:33 ourmachine systemd[1]: Starting Daily apt upgrade and clean activities...
Nov 20 06:34:42 ourmachine systemd[1]: Reloading.
Nov 20 06:34:44 ourmachine systemd[1]: message repeated 2 times: [ Reloading.]
Nov 20 06:34:44 ourmachine systemd[1]: Starting Daily apt download activities...
Nov 20 06:34:44 ourmachine systemd[1]: Stopping udev Kernel Device Manager...
Nov 20 06:34:44 ourmachine systemd[1]: Stopped udev Kernel Device Manager.
Nov 20 06:34:44 ourmachine systemd[1]: Starting udev Kernel Device Manager...
Nov 20 06:34:44 ourmachine systemd[1]: Started udev Kernel Device Manager.
Nov 20 06:34:45 ourmachine systemd[1]: Reloading.
Nov 20 06:34:45 ourmachine systemd[1]: Reloading.
Nov 20 06:35:13 ourmachine systemd[1]: Reexecuting.
Nov 20 06:35:13 ourmachine systemd[1]: Stopped Wait for Network to be Configured.
Nov 20 06:35:13 ourmachine systemd[1]: Stopping Wait for Network to be Configured...
Nov 20 06:35:13 ourmachine systemd[1]: Stopping Network Service..

Karl Stenerud (kstenerud) wrote :

Hi Tom, thanks for bringing up this issue!

As this package needs to work with both server and desktop editions, I'm not sure how this would work with networkmanager...

Would you be able to put together a simple VM test case that demonstrates the issue and fix, and ensures things still work as a whole?

Changed in keepalived (Ubuntu):
status: New → Confirmed
Ben Hollins (bhollins) wrote :

Hi Karl.
I can confirm this issue also, we encountered it this morning on a 2 node keepalived cluster consisting of 2 VMWARE ubuntu 18.04.1 VMs. In our case, a daily update task had restarted UDEV, which in turn restarted systemd-networkd. When this service restarted, the virtual ip on the MASTER node's NIC was lost, but nothing was recognised by keepalived and the ip was never restored on either MASTER or BACKUP. This caused an outage of services hosted on the virtualip.

When we investigated, we found that both MASTER and BACKUP nodes only had their own primary ip addresses, and neither node had the virtual ip. The virtual ip was unreachable. No managed failover by keepalived had occurred.

We restarted keepalived on both nodes, which caused the virtual ip to re-appear on the MASTER node's NIC. We can reproduce this on demand right now by manually restarting systemd-networkd, which causes the virtual ip to vanish. The only way to get it to return is to then manually restart keepalived.

Notably, when this problem occurs, nothing is logged by keepalived in syslog at all, which suggests it's not recognising the restart of networkd, or the loss of the virtual ip, and therefore not announcing it to the BACKUP node.

There is a good discussion on the ubuntu forums about this, and someone has confirmed that downgrading the keepalived package to the previous one resolves this behaviour, so it does look like the patch in the latest package version has potentially introduced this.

Here is the thread for ref:
https://ubuntuforums.org/showthread.php?t=2406400&p=13819524#post13819524

I'm happy to test anything required on a VM if necessary. We haven't taken any action to workaround this yet.

Tom Scholten (snowtom) wrote :

I don't have a desktop edition ready at the moment, but would be willing to pick that up if time allows. I concur with the findings of Ben, we seem to hit the same, although we 'patched' the systemd unit file.

Looking around a bit I'm not sure what the best way would be to make it systemd-networking and NetworkManager proof. Looking at the documentation (https://www.freedesktop.org/software/systemd/man/systemd.unit.html#PartOf) it looks like PartOf is actually not a requirement and as such both could be in there.

Ben Hollins (bhollins) wrote :

Just to add, we also attempted to work around this by adding a systemd override to netplan to recycle the keepalived service whenever network management was restarted. While it corrected the issue, it also created another problem whereby the system hung on startup after a reboot waiting endlessly for the network daemon to start. I had to revert this change in light of this.

For now, I've disabled ubuntu auto update task completely and hopefully this will avoid any network service restarts until the issue is resolved within the package.

tags: added: rls-dd-incoming
Dimitri John Ledkov (xnox) wrote :

There are three cases:
- upgrades from xenial with ifupdown
- fresh installs with netplan/systemd-networkd
- fresh installs with network-manager

I think the right way to integrate this with networkd is to ship a networkd-dispatcher script to do the right thing w.r.t. keepalived

http://manpages.ubuntu.com/manpages/bionic/man8/networkd-dispatcher.8.html

Andreas Hasenack (ahasenack) wrote :

Do all of you have daily network restarts? What's the reason? Or was this a one-off update that just by chance had a package upgrade that required such a restart?

That being said, I of course agree that losing the virtual IP in such a situation is bad.

Julian Andres Klode (juliank) wrote :

I guess you want to systemctl reload keepalived on most state changes in networkd, but I'm not sure. Probably not on off and no-carrier, as well, there's no traffic possible yet.

That said, I do wonder why you need to do this in the first place. keepalived really should listen to netlink and figure out interface status on its own.

Sebastien Bacher (seb128) wrote :

> something triggered a udev reload (and last week the same seemed to happen) which obviously triggers a network restart

why would an udev reload trigger a network restart? just as a random side note, snapd does interact with udev rules and can trigger reload (or did in the past) so it's not impossible it could be the one triggering the event

Ben Hollins (bhollins) wrote :

Andreas, in our case this was a one off. The system had been running for 2 months without any issues, and this sudden network restart due to a daily update check was not expected. We did a lot of testing different failover events (disconnecting vNIC, powering off a single node, stopping keepalived service etc), but we never specifically tested a restart of the networkd service. This bug has potentially gone unnoticed for some time because of this aspect, and the frequency of this event occurring (in our case), is low.

Just for visibility, the specific workaround I attempted to implement which recycled keepalived on netowrk restart was to add an override to networkd unit file using the following commands. This results in the immediate issues being fixed (keepalived restarts as desired), but prevents the network daemon from starting up after a reboot causing the system to become stuck in a wait loop. I had to boot to recovery mode and remove the override file again to restore functionality.

---------------------------------
sudo systemctl edit systemd-networkd

then in the override file via NANO:

[Service]
ExecStartPost=!/bin/systemctl restart keepalived
---------------------------------

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in networkd-dispatcher (Ubuntu):
status: New → Confirmed
Ben Hollins (bhollins) wrote :

We had this happen again this morning, causing an outage. Same issue, apt daily leads to a udev restart, which in turn restarted the network service and caused VRRP address to be lost on both haproxy nodes. I am going to try and completely disable the apt daily scheduled job while this bug remains.

Changed in networkd-dispatcher (Ubuntu):
status: Confirmed → Opinion
status: Opinion → Invalid
Robie Basak (racb) on 2019-02-23
tags: added: server-triage-discuss
Changed in keepalived (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
Robie Basak (racb) on 2019-02-27
tags: removed: server-triage-discuss
Changed in keepalived (Ubuntu):
assignee: nobody → Karl Stenerud (kstenerud)
description: updated
description: updated
Karl Stenerud (kstenerud) wrote :

There is a fix upstream for this issue in keepalived 2.0. I'm looking into what would be required to backport the fix. In the meantime, there is a workaround that I hope will be sufficient for your needs, as discovered by https://chr4.org/blog/2019/01/21/make-keepalived-play-nicely-with-netplan-slash-systemd-network/

You'll need to create a dummy interface, and then assign the virtual IP to that. Here's an example using a VM, which will generate a virtual ip of x.y.z.3. You can set your own last quad by changing the last part of the sed command '\1.3/g' to .4 or .215 or whatever:

multipass launch daily:bionic --name tester && multipass exec tester -- sudo su

Inside the VM:

apt update && apt dist-upgrade -y && apt install -y keepalived &&
echo "vrrp_instance VI_1 {
    virtual_router_id 33
    state MASTER
    interface ens3

    virtual_ipaddress {
        $(ip addr | grep 'inet ' | grep global | head -1 | sed 's/.*inet \([0-9]*\.[0-9]*\.[0-9]*\)\..*/\1.3/g') dev keepalived0
    }
}" >/etc/keepalived/keepalived.conf &&
echo "[NetDev]
Name=keepalived0
Kind=dummy" >/lib/systemd/network/90-keepalived.netdev &&
service systemd-networkd restart &&
service keepalived start

# There will be a new IP address x.y.z.3/32 added to keepalived0
ip addr

# Restart networkd. The IP address doesn't get destroyed like it did in the bug report
systemctl restart systemd-networkd
ip addr

# Restart keepalived. The IP address gets rebuild the same as before
systemctl restart keepalived
ip addr

Ben Hollins (bhollins) wrote :

Thanks Karl. This solution from Chris Aumann seems perfect, and I've just deployed it onto our HAPROXY pair. Just restarted udev and networkd, and everything survives as expected now. Much appreciated.

Robert Kirscht (robotic1) wrote :

Nice one keepalived crew for this excellent little app! Any news on a fix of this bug for the 1.x branch?

Chris Stone (cjstone707) wrote :

Thank you Karl, this one bit us too this morning. Will there be a fix soon?

Download full text (3.9 KiB)

Yup, us too :(

Just amended my original fix from filing this issue again to the systemd-service and made it persistent (for now) through our automation tooling

> Op 4 sep. 2019, om 19:57 heeft Chris Stone <email address hidden> het volgende geschreven:
>
> Thank you Karl, this one bit us too this morning. Will there be a fix
> soon?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1810583
>
> Title:
> Daily cron restarts network on unattended updates but keepalived
> .service is not restarted as a dependency
>
> Status in keepalived package in Ubuntu:
> Triaged
> Status in networkd-dispatcher package in Ubuntu:
> Invalid
>
> Bug description:
> [Impact]
>
> If systemd-networkd is restarted, any VRRP from keepalived are not
> restored.
>
> [Test Case]
>
> multipass launch daily:bionic --name tester && multipass exec tester
> -- sudo su
>
> apt update && apt dist-upgrade -y && apt install -y keepalived &&
> echo "vrrp_instance VI_1 {
> virtual_router_id 33
> state MASTER
> interface ens3
>
> virtual_ipaddress {
> $(ip addr | grep 'inet ' | grep global | head -1 | sed 's/.*inet \([0-9]*\.[0-9]*\.[0-9]*\)\..*/\1.3/g')
> }
> }" >/etc/keepalived/keepalived.conf &&
> service keepalived start &&
>
> # There will be a new IP address x.x.x.3/32 added to ens3
> ip addr
>
> # Restart networkd. The IP address won't come back
> systemctl restart systemd-networkd
> ip addr
>
> # Restart keepalived. The IP address will come back
> systemctl restart keepalived
> ip addr
>
> [Regression Potential]
>
> TODO
>
> [Original Description]
>
> Description: Ubuntu 18.04.1 LTS
> Release: 18.04
> ii keepalived 1:1.3.9-1ubuntu0.18.04.1 amd64 Failover and monitoring daemon for LVS clusters
>
> (From unanswered
> https://answers.launchpad.net/ubuntu/+source/keepalived/+question/676267)
>
> Since two weeks we lost our keepalived VRRP address on on our of
> systems, closer inspection reveals that this was due to the daily
> cronjob.Apparently something triggered a udev reload (and last week
> the same seemed to happen) which obviously triggers a network restart.
>
> Are we right in assuming the below patch is the correct way (and
> shouldn't this be in the default install of the systemd service of
> keepalived).
>
> /etc/systemd/system/multi-user.target.wants/keepalived.service:
> --- keepalived.service.orig 2018-11-20 09:17:06.973924706 +0100
> +++ keepalived.service 2018-11-20 09:05:55.984773226 +0100
> @@ -4,6 +4,7 @@
> Wants=network-online.target
> # Only start if there is a configuration file
> ConditionFileNotEmpty=/etc/keepalived/keepalived.conf
> +PartOf=systemd-networkd.service
>
> Accompanying syslog:
> Nov 20 06:34:33 ourmachine systemd[1]: Starting Daily apt upgrade and clean activities...
> Nov 20 06:34:42 ourmachine systemd[1]: Reloading.
> Nov 20 06:34:44 ourmachine systemd[1]: message repeated 2 times: [ Reloading.]
> Nov 20 06:34:44 ourmachine systemd[1]: Starting Daily apt download activities...
> Nov 20 06:34:44 ourmachine systemd[1]: Stopping udev Kernel Device M...

Read more...

The following 3 bugs:

https://bugs.launchpad.net/bugs/1815101
https://bugs.launchpad.net/bugs/1819074
https://bugs.launchpad.net/bugs/1810583

Have the same root cause: the fact that systemd-network messes with secondary IP addresses in NICs managed by systemd.

I'm marking all other cases as a duplicate of LP: #1815101.

TODO here is the following:

- There are mainly 2 "fixes" for this issue:

1) keepalived is able to recognize systemd-networkd changes and change cluster status in order to reconfigure managed NICs (keepalived (> 2.0.x)).

2) systemd-networkd implements a new stanza (KeepConfiguration=) to systemd service unit files in order to fix not only this behavior but all those HA related software that manages secondary IPs and/or aliases to NICs being managed by systemd-networkd.

I think the most appropriate would make sure those 2 features work in Eoan, both, together, and then make sure the SRUs are done to Disco and Bionic. One problem w/ the item (2) is that netplan will also have to support the new "KeepConfiguration=" systemd service file stanza, but, the fix (2) is more appropriate for all other HA related softwares controlling virtual IPs (CTDB, Pacemaker, and so ...).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers