ha_vrrp_health_check_interval causes constantly VRRP transitions

Bug #1793102 reported by Hua Zhang
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

Commit 185d6cbc648fd041402a5034b04b818da5c7136e added support for keepalived VRRP health check, but it will cause constantly VRRP transitions if you actually enable the option ha_vrrp_health_check_interval.

It seems to be because keepalived can't run ha_check_script_1.sh well, while we can run ha_check_script_1.sh well by hand.

Sep 18 08:19:41 juju-23f84c-queens-dvr-5 Keepalived_vrrp[8448]: /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh exited with status 1
Sep 18 08:19:41 juju-23f84c-queens-dvr-5 Keepalived_vrrp[8448]: VRRP_Script(ha_health_check_1) failed
Sep 18 08:19:43 juju-23f84c-queens-dvr-5 Keepalived_vrrp[8448]: VRRP_Instance(VR_1) Entering FAULT STATE
Sep 18 08:19:43 juju-23f84c-queens-dvr-5 Keepalived_vrrp[8448]: VRRP_Instance(VR_1) removing protocol Virtual Routes
Sep 18 08:19:43 juju-23f84c-queens-dvr-5 Keepalived_vrrp[8448]: VRRP_Instance(VR_1) removing protocol VIPs.
Sep 18 08:19:43 juju-23f84c-queens-dvr-5 Keepalived_vrrp[8448]: VRRP_Instance(VR_1) removing protocol E-VIPs.
Sep 18 08:19:43 juju-23f84c-queens-dvr-5 Keepalived_vrrp[8448]: VRRP_Instance(VR_1) Now in FAULT state

root@juju-23f84c-queens-dvr-5:~# ll /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh
-r-x-w---- 1 neutron neutron 109 Sep 18 03:45 /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh*

Tags: l3-ha sts
Boden R (boden)
tags: added: l3-ha
Hua Zhang (zhhuabj)
tags: added: sts
Revision history for this message
Miguel Lavalle (minsel) wrote :

Hi,

Thanks for this report. I have some questions:

1) Where do you see the constant transitions? Do you see them in the L3 agent log? I just configured my DVR/HA development system with ha_vrrp_health_check_interval = 5 and don't see any transitions. I am running with code very close to master

2) What value are you using for ha_vrrp_health_check_interval? I am using the recommended value: https://github.com/openstack/neutron/blob/master/neutron/conf/agent/l3/ha.py#L50

Revision history for this message
Miguel Lavalle (minsel) wrote :

I'll let my system run for a while and report back whether I see transitions. Cheers

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi Miguel,

VRRP transitions in my case description are from /var/log/syslog, not L3 agent log, and I am using the recommended value ha_vrrp_health_check_interval=30 according to the documentation [1]. The problem is very easy to reproduce, as long as we enable this option, the problem will appear repeatedly. as long as we disable it, the probem will disappear. Looking forward to your test result, thank you very much.

[1] https://docs.openstack.org/ocata/networking-guide/config-dvr-ha-snat.html

Revision history for this message
Hua Zhang (zhhuabj) wrote :

I have tried the following 4 methods, but they did not help.

1, chmod +x /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh

root@juju-23f84c-queens-dvr-5:~# ll /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh
-r-x-w---- 1 neutron neutron 109 Sep 18 03:45 /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh*
root@juju-23f84c-queens-dvr-5:~# chmod +x /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh
root@juju-23f84c-queens-dvr-5:~# ll /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh
-r-x-wx--x 1 neutron neutron 109 Sep 18 03:45 /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh*

2, sudo -u neutron /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh || echo 'error'

root@juju-23f84c-queens-dvr-5:~# cat /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh
#!/bin/bash -eu
ip a | grep fe80::f816:3eff:fe78:bd5c || exit 0
ping -c 1 -w 1 10.5.0.1 1>/dev/null || exit 1

root@juju-23f84c-queens-dvr-5:~# /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh || echo 'error'
root@juju-23f84c-queens-dvr-5:~# sudo -u neutron /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh || echo 'error'

3, added the line 'user neutron' into the section vrrp_script of the file /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/keepalived.conf

vrrp_script ha_health_check_1 {
    script "/var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/ha_check_script_1.sh"
    interval 10
    fall 2
    rise 2
    user neutron
}

4, added the line 'enable_script_security' into the section global_defs of the file /var/lib/neutron/ha_confs/909c6b55-9bc6-476f-9d28-c32d031c41d7/keepalived.conf, it can stop VRRP transitions but seems the VRRP script was stoped as well.

global_defs {
    notification_email_from <email address hidden>
    router_id neutron
    enable_script_security
}

Revision history for this message
Hua Zhang (zhhuabj) wrote :

I can't reproduce this problem today, very lucky, all are running well. but it's pity that I lost my last test env so that today I am using a new test env and I can't compare the difference between two. But I believe that's because my new test env has included the following patch. thanks all

https://github.com/acassen/keepalived/commit/e90a633c34fbe6ebbb891aa98bf29ce579b8b45c

Changed in neutron:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.