L3 HA: Unstable rescheduling time for keepalived v1.2.7

Bug #1497272 reported by Ann Taraday
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
High
Major Hayden
neutron
Won't Fix
Undecided
Unassigned
openstack-manuals
Fix Released
Medium
John Davidge

Bug Description

I have tested work of L3 HA on environment with 3 controllers and 1 compute (Kilo) with this simple scenario:
1) ping vm by floating ip
2) disable master l3-agent (which ha_state is active)
3) wait for pings to continue and another agent became active
4) check number of packages that were lost

My results are following:
1) When max_l3_agents_per_router=2, 3 to 4 packages were lost.
2) When max_l3_agents_per_router=3 or 0 (meaning the router will be scheduled on every agent), 10 to 70 packages were lost.

I should mention that in both cases there was only one ha router.

It is expected that less packages will be lost when max_l3_agents_per_router=3(0).

Revision history for this message
Assaf Muller (amuller) wrote :

What is your installed keepalived version? Check out https://bugs.launchpad.net/neutron/+bug/1433172.

Revision history for this message
Assaf Muller (amuller) wrote :

More context: That bug that I linked (Which is a bug in certain keepalived versions) only manifests if you have more than two L3 agents participating in the HA router domains, which would explain the behavior that you're seeing. If you look at /var/log/syslog or messages, if you're using a 'bad' version of keepalived, you'll see that keepalived is flapping between backup and master.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

I'm using Keepalived v1.2.7 (08/14,2013). I checked /var/log/syslog and I don't find any strange messages from keepalived .

Revision history for this message
Assaf Muller (amuller) wrote :

If you monitor /var/log/syslog for the node you're killing, and for the other two nodes, what messages are you seeing from keepalived?

Revision history for this message
Assaf Muller (amuller) wrote :

And how do the timestamps look like? Does it take minutes for keepalived itself to stabilize? If so I'd try with something like keepalived 1.2.13.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

I've updated keepalived to 1.2.19 and now L3 HA works very nice! Thanks a lot!
The only concern here is that keeplived had a lot of dependencies, so updating it can be inconvenient. Do we have a list of reliable versions of keepalived? May be I've missed this, if not it would be good to have it.

summary: - L3 HA: Unstable rescheduling time
+ L3 HA: Unstable rescheduling time for keepalived v1.2.7
Revision history for this message
Assaf Muller (amuller) wrote :

As you can imagine I've been developing and testing L3 HA on Fedora/CentOS and RHEL. When I started working on, v1.2.13 was the latest, and it works well. v1.2.14 is broken, so are early versions of v1.2.15. v1.2.16+ should be good. IPv6 support works from v1.2.10+. That's all I've got :) This should probably be documented properly in the network guide...

Changed in neutron:
status: New → Triaged
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

Thanks a lot!

It is good to add this in network guide, as Ubuntu(12.04;14.04;14.10;15.04) is used keepalived v1.2.7 and only in Ubuntu 15.10 keepalived v1.2.19 is going to be used.

Revision history for this message
Akihiro Motoki (amotoki) wrote :

Per bug discussion, it is better to add a note on appropriate keepalived versions to the networking guide.

tags: added: networking-guide
Changed in openstack-manuals:
status: New → Triaged
importance: Undecided → Medium
Chason Chan (chen-xing)
Changed in openstack-manuals:
assignee: nobody → Chason (chen-xing)
Changed in openstack-ansible:
assignee: nobody → Jean-Philippe Evrard (jean-philippe-evrard)
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-os_neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/338129

Changed in openstack-ansible:
status: Confirmed → In Progress
Changed in openstack-ansible:
assignee: Jean-Philippe Evrard (jean-philippe-evrard) → Jesse Pretorius (jesse-pretorius)
Changed in openstack-ansible:
assignee: Jesse Pretorius (jesse-pretorius) → Jean-Philippe Evrard (jean-philippe-evrard)
Changed in openstack-ansible:
assignee: Jean-Philippe Evrard (jean-philippe-evrard) → Major Hayden (rackerhacker)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-os_neutron (master)

Reviewed: https://review.openstack.org/338129
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-os_neutron/commit/?id=0b8721141f9b526ba4902f5cfc53f05c2fc0758e
Submitter: Jenkins
Branch: master

commit 0b8721141f9b526ba4902f5cfc53f05c2fc0758e
Author: Jean-Philippe Evrard <email address hidden>
Date: Wed Jul 6 10:10:52 2016 +0100

    Use UCA for non-OVS neutron

    This commit refactors tasks to allow the use of UCA for Linux Bridge.
    It also changes default behavior: now every neutron install will
    make use of Ubuntu Cloud Archive, unless mentionned.

    Closes-Bug: 1497272
    Closes-Bug: 1433172

    Change-Id: I4373f544eb178720f33795a71adae925a8b8cb03
    Signed-off-by: Jean-Philippe Evrard <email address hidden>

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible-os_neutron 14.0.0.0b2

This issue was fixed in the openstack/openstack-ansible-os_neutron 14.0.0.0b2 development milestone.

Chason Chan (chen-xing)
Changed in openstack-manuals:
assignee: Chason (chen-xing) → nobody
Changed in openstack-manuals:
milestone: none → ocata
assignee: nobody → John Davidge (john-davidge)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-manuals (master)

Fix proposed to branch: master
Review: https://review.openstack.org/430206

Changed in openstack-manuals:
status: Triaged → In Progress
Changed in openstack-manuals:
assignee: John Davidge (john-davidge) → Alexandra Settle (alexandra-settle)
Changed in openstack-manuals:
assignee: Alexandra Settle (alexandra-settle) → John Davidge (john-davidge)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-manuals (master)

Reviewed: https://review.openstack.org/430206
Committed: https://git.openstack.org/cgit/openstack/openstack-manuals/commit/?id=430909f6e15e56308c3895007e20a91f73b9412a
Submitter: Jenkins
Branch: master

commit 430909f6e15e56308c3895007e20a91f73b9412a
Author: John Davidge <email address hidden>
Date: Tue Feb 7 11:20:31 2017 +0000

    [networking] Add a note on bug in keepalived

    Describes how a bug in keepalived v1.2.15 and earlier can affect
    operation of neutron features, and recommends upgrading to a greater
    version to avoid problems.

    Change-Id: I05de49e0043347b2cfcce3af8cf68796f70334b9
    Closes-Bug: #1497272

Changed in openstack-manuals:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-manuals 15.0.0

This issue was fixed in the openstack/openstack-manuals 15.0.0 release.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

It's keepalived issue, and supported platforms like centos/rhel or xenial, already ship fixed packages. We also documented the issue in networking guide. There seems to be nothing we can do more on neutron side, so moving the bug to Won't Fix.

Changed in neutron:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.