keepalived raising VIP apply error

Bug #1642763 reported by bugproxy on 2016-11-17
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
keepalived (Ubuntu)
Medium
Unassigned
Xenial
Undecided
Unassigned
Yakkety
Undecided
Unassigned

Bug Description

[Impact]

 * keepalived on ppc64el (due to a large page size) experiences "Netlink: error: message truncated" messages.

 * These Netlink truncations result in keepalived thinking that the the underlying device does not exist, even though it does.

[Test Case]

 * Creating 100 veth interfaces ppc64el and verify if "Netlink: error: message truncated" errors are emitted. If so, the bug is present. If not, the bug is fixed.

[Regression Potential]

 * This is code issue, fixed upstream, in the keepalived code when the system page size exceeds 4096. The upstream fix was backported to all releases and should only properly limit the size of the buffer used for netlink to at most 8192 on systems with a page size greater than 8192. I believe risk of regression is very low.

 * Using the tests provided by David Wilder, I ran on both x86_64 and ppc64el LXD containers. Without the backported changes, I saw no issues on x86_64, and the reported issue on ppc64el (as expected, as a page size greater than 4K is required to see the buffer size issue). With the backported changes, both architectures show no issue with the provided testcase.

---

== Comment: #0 - Andrew Thorstensen - 2016-11-17 09:50:25 ==

---Problem Description---
Using Ubuntu 16.04 on ppc64le, we are building a 'neutron network node' using the VRRP configuration (built on keepalived).

Information on this OpenStack configuration can be found here: https://wiki.openstack.org/wiki/Neutron/L3_High_Availability_VRRP

When we run, the configuration is failing to apply via keepalived.

The logs post the following:
Nov 17 02:58:31 p8test-lp1 Keepalived_vrrp[54542]: VRRP is trying to assign VIP to unknown qr-a5f5ba96-52 interface !!! go out and fix your conf !!!

However, the device DOES exist. But the keepalived config just doesn't always deploy it.

ii keepalived 1:1.2.19-1 ppc64el Failover and monitoring daemon for LVS clusters

This configuration sometimes works, but does sometimes fail on Ubuntu 16.04.1

---uname output---
Linux p8test-lp1 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:38:24 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
This is a Power8 system with Ubuntu 16.04.1 installed. Though we see no indication that this is specific to Power.

Machine Type = S822L

Machine Type = 8286-42A

---Steps to Reproduce---
 Install openstack. Run the network node in a VRRP HA configuration. Create a router and assign a global IP.

== Comment: #5 - David J. Wilder - 2016-11-17 15:58:04 ==
The problem is fixed in this upstream commit:

https://github.com/acassen/keepalived/commit/9f327bbf3e86def1055a106eda0633638bda0345

On systems with a page size larger than 4096 keepalived may report:

"Netlink: error: message truncated" messages

This error was reported on a ppc64le in an OpenStack/Nutron environment.
Ppc64le is using a 64k pages size. I found that keepalived's netlink recvmsg
buffer was too small causing messages to be truncated. The size of the read
buffer for the netlink socket should be based on page size however, it should
not exceed 8192. See the comment in the patch.

I tested the fix by creating 100 veth interfaces and verifying the errors
did not return.

Signed-off-by: David Wilder <email address hidden>
Signed-off-by: Quentin Armitage <email address hidden>
...

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-148871 severity-critical targetmilestone-inin---

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → keepalived (Ubuntu)
Jon Grimm (jgrimm) wrote :

Thank you for your bug report and even more so for an upstreamed patch. I'm thinking this will need pushed into the cloud-archive as well. I'll coordinate if so.

Changed in keepalived (Ubuntu):
status: New → Triaged

So I'd consider this will end up as Xenial SRU and depending on your counsel into cloud-archive as Jon asked you.

Given the SRU policy (https://wiki.ubuntu.com/StableReleaseUpdates) a fix should go into the Development release first (currently zesty) and become SRUs from there.

The fix upstream is just two days old and not released yet.

Currently Ubuntu has no delta over the Debian version which means we should report the issue to Debian, fix it there and pick it up into Zesty on a merge/sync of their latest version.
Your backported patch applies (almost) cleanly to the latest version in Debian/Ubuntu, so I'd ask you if you could report that to Debian as well and mention the created Debbug here.

We could then sync the package afterwards and avoid having to maintain a Delta of some sorts.
From there the path to SRUs would be free then.

Changed in keepalived (Ubuntu):
importance: Undecided → Medium
Jon Grimm (jgrimm) wrote :

Dave Wilder, do you have a way to trigger bug/test fix that wouldn't require a full openstack + HA-VRPP setup?

Download full text (7.1 KiB)

------- Comment From <email address hidden> 2016-11-28 13:07 EDT-------
(In reply to comment #13)
> Dave Wilder, do you have a way to trigger bug/test fix that wouldn't require
> a full openstack + HA-VRPP setup?

Here is my test script and the keepalived config I used (generated by openstack). Openstack is not needed to run my test script, just the config file.

vrrp_instance VR_1 {
state BACKUP
interface ha-99a3cb02-dc
virtual_router_id 1
priority 50
garp_master_delay 60
nopreempt
advert_int 2
track_interface {
ha-99a3cb02-dc
}
virtual_ipaddress {
169.254.0.1/24 dev ha-99a3cb02-dc
}
virtual_ipaddress_excluded {
27.0.0.1/24 dev qr-bc6c9831-52
9.47.64.9/20 dev qg-9b2d21c4-59
fe80::f816:3eff:fe95:9c41/64 dev qg-9b2d21c4-59 scope link
fe80::f816:3eff:febb:d0ff/64 dev qr-bc6c9831-52 scope link
}
virtual_routes {
0.0.0.0/0 via 9.47.79.254 dev qg-9b2d21c4-59
}
}

The test script. Note: this script will not generate a working keepalived setup but it is sufficient to demonstrate the bug and verify the fix.

#!/bin/bash

# List of interfaces to create
Interfaces="ha-99a3cb02-dc qr-bc6c9831-52 qg-9b2d21c4-59"

# Un-comment to generate 100 extra veth pairs
# Interfaces="`seq 1 100` ha-99a3cb02-dc qr-bc6c9831-52 qg-9b2d21c4-59"

for i in $Interfaces; do
echo Creaating $i
ip link add $i type veth peer name v-$PEER
ifconfig $i up
PEER=$(($PEER+1))
done

# KEEPALIVED="/home/wilder/scratch/keepalived-1.2.24/keepalived/keepalived"
KEEPALIVED="keepalived"

CONF=$PWD/test.conf

$KEEPALIVED -d -n -l -D -f $CONF

echo Done, cleaning up
for i in $Interfaces; do
ip link del $i
done

-------
( test run showing the problem )
# ./setup
Creaating ha-99a3cb02-dc
Creaating qr-bc6c9831-52
Creaating qg-9b2d21c4-59
Starting Keepalived v1.2.24 (11/11,2016)
Opening file '/home/wilder/scratch/test.conf'.
Starting Healthcheck child process, pid=1130854
Starting VRRP child process, pid=1130855
Initializing ipvs
Netlink: error: message truncated <<<<< ****
Netlink: error: message truncated <<<<<<
Netlink: error: message truncated
Netlink: error: message truncated
Netlink: error: message truncated
Netlink: error: message truncated
Netlink reflector reports IP 192.168.0.2 added
Netlink reflector reports IP 192.168.0.2 added
Netlink reflector reports IP 172.17.0.1 added
Netlink reflector reports IP 172.17.0.1 added
Netlink reflector reports IP fe80::9abe:94ff:fe0d:f2f4 added
Netlink reflector reports IP fe80::9abe:94ff:fe0d:f2f4 added
Netlink reflector reports IP fe80::42:dbff:fe53:c725 added
Netlink reflector reports IP fe80::42:dbff:fe53:c725 added
Registering Kernel netlink reflector
Registering Kernel netlink reflector
Registering Kernel netlink command channel
Registering Kernel netlink command channel
Registering gratuitous ARP shared channel
Opening file '/home/wilder/scratch/test.conf'.
Opening file '/home/wilder/scratch/test.conf'.
Cant find interface ha-99a3cb02-dc for vrrp_instance VR_1 !!!
ha-99a3cb02-dc no match, ignoring...
VRRP is trying to assign ip address 169.254.0.1/24 to unknown ha-99a3cb02-dc interface !!! go out and fix your conf !!!
VRRP is trying to assign ip address 9.47.64.9/20 to unknown qg-9b2d21c4-59 interface !!! go out and fix your conf !!!
VRRP is...

Read more...

Jon Grimm (jgrimm) wrote :

Thanks David. Asking Nish to take a look at this for you.

Changed in keepalived (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Nish Aravamudan (nacc)
Nish Aravamudan (nacc) on 2016-11-29
Changed in keepalived (Ubuntu):
status: Triaged → In Progress
Changed in keepalived (Ubuntu Xenial):
status: New → In Progress
Changed in keepalived (Ubuntu Yakkety):
status: New → In Progress
Changed in keepalived (Ubuntu Xenial):
assignee: nobody → Nish Aravamudan (nacc)
Changed in keepalived (Ubuntu Yakkety):
assignee: nobody → Nish Aravamudan (nacc)
Nish Aravamudan (nacc) on 2016-11-29
Changed in keepalived (Ubuntu):
status: In Progress → Fix Committed
Nish Aravamudan (nacc) on 2016-11-29
description: updated
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-11-29 16:21 EDT-------
(In reply to comment #16)
> Thanks David. Asking Nish to take a look at this for you.

Thanks for your attention to this issue. This has become an urgent issue for our customer. If you can please provide an ETA when a fix will be available.

Nish Aravamudan (nacc) wrote :

On 29.11.2016 [21:30:21 -0000], bugproxy wrote:
> ------- Comment From <email address hidden> 2016-11-29 16:21 EDT-------
> (In reply to comment #16)
> > Thanks David. Asking Nish to take a look at this for you.
>
> Thanks for your attention to this issue. This has become an urgent
> issue for our customer. If you can please provide an ETA when a fix
> will be available.

It will first need to get through the zesty queue (should only take a
few hours) and then the SRU team will need to consider it:
https://wiki.ubuntu.com/StableReleaseUpdates. Once they provide it in
the appropriate -proposed pockets, it can take a week to make it to
-updates, after verification.

Thanks,
Nish

description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package keepalived - 1:1.2.24-1ubuntu1

---------------
keepalived (1:1.2.24-1ubuntu1) zesty; urgency=medium

  * debian/patches/fix_message_truncation_with_large_pagesizes.patch:
    Resolve "Netlink: error: message truncated" messages. Thanks to
    David Wilder <email address hidden>. Closes LP: #1642763.

 -- Nishanth Aravamudan <email address hidden> Tue, 29 Nov 2016 09:45:12 -0800

Changed in keepalived (Ubuntu):
status: Fix Committed → Fix Released

Hello bugproxy, or anyone else affected,

Accepted keepalived into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/keepalived/1:1.2.23-1ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in keepalived (Ubuntu Yakkety):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in keepalived (Ubuntu Xenial):
status: In Progress → Fix Committed
Brian Murray (brian-murray) wrote :

Hello bugproxy, or anyone else affected,

Accepted keepalived into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/keepalived/1:1.2.19-1ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

------- Comment From <email address hidden> 2016-12-01 15:14 EDT-------
Nish-

Thank you for the quick turnaround. I verified the new package with my test code. The bug submitter will verify his set-up as well.

Version verified:
1:1.2.19-1ubuntu0.1

Kyle L. Henderson (kyleh) wrote :

Thank you for this fix.

I setup an OpenStack cluster using 2 controller nodes on Power systems and Xenial. The fix worked perfectly and resolved the issue I was seeing.

Version verified:
1:1.2.19-1ubuntu0.1

root@kyle-pwr-1:~# apt list --installed | grep keepalived

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

keepalived/xenial-proposed,now 1:1.2.19-1ubuntu0.1 ppc64el [installed]

bugproxy (bugproxy) on 2016-12-02
tags: added: targetmilestone-inin16041 verification-done
removed: targetmilestone-inin--- verification-needed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-07 12:54 EDT-------
Hi Nish

Do you have an ETA when the update will be available on the fix stream?

Sorry to bug you, pressure on my end.

Thanks for your support.

Nish Aravamudan (nacc) wrote :

Hi Dave,

I just checked and it seems like there was an automated test regression in neutron on amd64. Oddly, it passed on all other architectures, so I'm seeing if it's reproducible and if so, will debug.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-09 14:49 EDT-------
(In reply to comment #27)
> Hi Dave,
>
> I just checked and it seems like there was an automated test regression in
> neutron on amd64. Oddly, it passed on all other architectures, so I'm seeing
> if it's reproducible and if so, will debug.

Hi Nish

If the amd64 regression turns out to be related to the patch you might want to try removing the patch and simply change the buffer size to 8192. This accomplishes the same results but is far simpler.

/* Our netlink parser */
static int
netlink_parse_info(int (*filter) (struct sockaddr_nl *, struct nlmsghdr *),
@@ -254,7 +273,7 @@ netlink_parse_info(int (*filter) (struct sockaddr_nl *, struct nlmsghdr *),
int error;
while (1) {
- char buf[4096];
+ char buf[8192];

Thanks for you help

Dave.

Hi Dave,
Sorry for the delay, I was at a sprint. The regression was a false negative
due to other issues. I reran the test and it passed, so given the already
v-d status, I think it should transition within 7 days normally.

On Dec 9, 2016 21:11, "bugproxy" <email address hidden> wrote:

> ------- Comment From <email address hidden> 2016-12-09 14:49 EDT-------
> (In reply to comment #27)
> > Hi Dave,
> >
> > I just checked and it seems like there was an automated test regression
> in
> > neutron on amd64. Oddly, it passed on all other architectures, so I'm
> seeing
> > if it's reproducible and if so, will debug.
>
> Hi Nish
>
> If the amd64 regression turns out to be related to the patch you might
> want to try removing the patch and simply change the buffer size to
> 8192. This accomplishes the same results but is far simpler.
>
> /* Our netlink parser */
> static int
> netlink_parse_info(int (*filter) (struct sockaddr_nl *, struct nlmsghdr *),
> @@ -254,7 +273,7 @@ netlink_parse_info(int (*filter) (struct sockaddr_nl
> *, struct nlmsghdr *),
> int error;
> while (1) {
> - char buf[4096];
> + char buf[8192];
>
> Thanks for you help
>
> Dave.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1642763
>
> Title:
> keepalived raising VIP apply error
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/keepalived/+
> bug/1642763/+subscriptions
>

------- Comment From <email address hidden> 2016-12-09 18:03 EDT-------
(In reply to comment #30)
> Hi Dave,
> Sorry for the delay, I was at a sprint. The regression was a false negative
> due to other issues. I reran the test and it passed, so given the already
> v-d status, I think it should transition within 7 days normally.

Good news! Thanks for your help.

Brian Murray (brian-murray) wrote :

I only see comments about this being verified in Xenial, not Yakkety so I've fixed the tags appropriately.

tags: added: verification-done-xenial verification-needed
removed: verification-done
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package keepalived - 1:1.2.19-1ubuntu0.1

---------------
keepalived (1:1.2.19-1ubuntu0.1) xenial; urgency=medium

  * debian/patches/fix_message_truncation_with_large_pagesizes.patch:
    Resolve "Netlink: error: message truncated" messages. Thanks to
    David Wilder <email address hidden>. Closes LP: #1642763.

 -- Nishanth Aravamudan <email address hidden> Tue, 29 Nov 2016 10:31:22 -0800

Changed in keepalived (Ubuntu Xenial):
status: Fix Committed → Fix Released
Nish Aravamudan (nacc) wrote :

Dave, would you be able to also test in 16.10? I'll do my best to setup an environment on my end, but it would be good to have you verify it as well.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-14 12:22 EDT-------
(In reply to comment #34)
> Dave, would you be able to also test in 16.10? I'll do my best to setup an
> environment on my end, but it would be good to have you verify it as well.

No problem, I have 16.10 VM I can test with.

I was able to update and verify the fix on 16.04 (ppc64le), but when I installed keepalived on 16.10, I don't see that the fix has been applied yet. Should I be pulling from a different repo? Or am I missing something.

root@ubu-1610:~# cat /etc/issue
Ubuntu 16.10 \n \l

root@ubu-1610:~# apt -a list keepalived
Listing... Done
keepalived/yakkety,now 1:1.2.23-1 ppc64el [installed]

root@ubu-1610:~# keepalived --version
Keepalived v1.2.23 (07/21,2016) <<<< prior to fix date.

Nish Aravamudan (nacc) wrote :

Hi Dave,

Did you enable yakkety-proposed in your VM? Per rmadison:

keepalived | 1:1.2.23-1ubuntu0.1

is available in yakkety-proposed for all architectures.

Thanks,
Nish

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-14 19:22 EDT-------
(In reply to comment #36)

> Did you enable yakkety-proposed in your VM? Per rmadison:

Doh. Thanks for hint.

I was able to successfully verify keepalived from yakkety-proposed. My test case ran fine, no errors.

# apt -a list keepalived
Listing... Done
keepalived/yakkety-proposed,now 1:1.2.23-1ubuntu0.1 ppc64el [installed]
keepalived/yakkety 1:1.2.23-1 ppc64el

# keepalived --version
Keepalived v1.2.23 (12/01,2016)

Nish Aravamudan (nacc) on 2016-12-15
tags: added: verification-done-yakkety
removed: verification-needed

The verification of the Stable Release Update for keepalived has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package keepalived - 1:1.2.23-1ubuntu0.1

---------------
keepalived (1:1.2.23-1ubuntu0.1) yakkety; urgency=medium

  * debian/patches/fix_message_truncation_with_large_pagesizes.patch:
    Resolve "Netlink: error: message truncated" messages. Thanks to
    David Wilder <email address hidden>. Closes LP: #1642763.

 -- Nishanth Aravamudan <email address hidden> Tue, 29 Nov 2016 10:00:52 -0800

Changed in keepalived (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Nish Aravamudan (nacc) on 2016-12-15
Changed in keepalived (Ubuntu):
assignee: Nish Aravamudan (nacc) → nobody
Changed in keepalived (Ubuntu Xenial):
assignee: Nish Aravamudan (nacc) → nobody
Changed in keepalived (Ubuntu Yakkety):
assignee: Nish Aravamudan (nacc) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers