Multiple SIGHUPs to keepalived might trigger re-election

Bug #1647432 reported by John Schwarz on 2016-12-05
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
Ihar Hrachyshka

Bug Description

As the title says, multiple SIGHUPs that are sent to the keepalived process might cause it to forfeit mastership and re-negotiate a new master (which might be the original master). This means that when, for example, associating/disassociating 2 floatingips in quick succession (each triggers a SIGHUP), the master node will forfeit re-election (causing it to switch to BACKUP, thus removing all the remaining FIP's IPs and severing connectivity.

Fix proposed to branch: master
Review: https://review.openstack.org/407099

Changed in neutron:
status: New → In Progress
Changed in neutron:
milestone: none → ocata-rc1
tags: added: newton-backport-potential
Changed in neutron:
milestone: ocata-rc1 → pike-1
Changed in neutron:
assignee: John Schwarz (jschwarz) → Jakub Libosvar (libosvar)
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Cedric Brandily (cbrandily)
Changed in neutron:
assignee: Cedric Brandily (cbrandily) → Jakub Libosvar (libosvar)
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Ihar Hrachyshka (ihar-hrachyshka)
Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → Jakub Libosvar (libosvar)
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Ihar Hrachyshka (ihar-hrachyshka)
Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → Jakub Libosvar (libosvar)
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Ihar Hrachyshka (ihar-hrachyshka)

Reviewed: https://review.openstack.org/407099
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=977d254cc69915819cf4226dc8cfc8c36969735b
Submitter: Jenkins
Branch: master

commit 977d254cc69915819cf4226dc8cfc8c36969735b
Author: John Schwarz <email address hidden>
Date: Mon Dec 5 14:15:17 2016 +0200

    Throttle SIGHUPs to keepalived

    Multiple SIGHUPs in quick succession might cause the master keepalived
    to forfeit its mastership (which will cause keepalived to remove IPs of
    its external devices, severing connectivity). This can happen when, for
    example, associating or disassociating multiple floatingips.

    The patch makes the agent throttle SIGHUP sent to keepalived: the very first
    SIGHUP is always sent; as for subsequent signals, they are delayed till
    agent threshold is reached. (It's 3 seconds by default.)

    As an example, when three consequent router updates trigger keepalived
    respawn then:
    * the very first signal is sent as usual;
    * the second signal is deferred and sent in up to 3 seconds since the
      first signal;
    * the third signal is ignored, though the change that triggered it will
      be correctly applied by the second signal handler when it is triggered
      after threshold delay.

    If the last time a spawn request occurred is older than current-time
    minus threshold then there is no delay.

    Co-Authored-By: Jakub Libosvar <email address hidden>
    Co-Authored-By: Cedric Brandily <email address hidden>
    Co-Authored-By: Ihar Hrachyshka <email address hidden>

    Closes-Bug: 1647432
    Change-Id: I2955e0de835458a2eea4dd088addf33b656f8670

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/451140
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=23c7c8a08eee53c85b8cdd4549f024f6865e784c
Submitter: Jenkins
Branch: stable/ocata

commit 23c7c8a08eee53c85b8cdd4549f024f6865e784c
Author: John Schwarz <email address hidden>
Date: Mon Dec 5 14:15:17 2016 +0200

    Throttle SIGHUPs to keepalived

    Multiple SIGHUPs in quick succession might cause the master keepalived
    to forfeit its mastership (which will cause keepalived to remove IPs of
    its external devices, severing connectivity). This can happen when, for
    example, associating or disassociating multiple floatingips.

    The patch makes the agent throttle SIGHUP sent to keepalived: the very first
    SIGHUP is always sent; as for subsequent signals, they are delayed till
    agent threshold is reached. (It's 3 seconds by default.)

    As an example, when three consequent router updates trigger keepalived
    respawn then:
    * the very first signal is sent as usual;
    * the second signal is deferred and sent in up to 3 seconds since the
      first signal;
    * the third signal is ignored, though the change that triggered it will
      be correctly applied by the second signal handler when it is triggered
      after threshold delay.

    If the last time a spawn request occurred is older than current-time
    minus threshold then there is no delay.

    Co-Authored-By: Jakub Libosvar <email address hidden>
    Co-Authored-By: Cedric Brandily <email address hidden>
    Co-Authored-By: Ihar Hrachyshka <email address hidden>

    Closes-Bug: 1647432
    Change-Id: I2955e0de835458a2eea4dd088addf33b656f8670
    (cherry picked from commit 977d254cc69915819cf4226dc8cfc8c36969735b)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/451150
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=23ded198e6f9c433b5edc350fb23abd33ccbd127
Submitter: Jenkins
Branch: stable/newton

commit 23ded198e6f9c433b5edc350fb23abd33ccbd127
Author: John Schwarz <email address hidden>
Date: Mon Dec 5 14:15:17 2016 +0200

    Throttle SIGHUPs to keepalived

    Multiple SIGHUPs in quick succession might cause the master keepalived
    to forfeit its mastership (which will cause keepalived to remove IPs of
    its external devices, severing connectivity). This can happen when, for
    example, associating or disassociating multiple floatingips.

    The patch makes the agent throttle SIGHUP sent to keepalived: the very first
    SIGHUP is always sent; as for subsequent signals, they are delayed till
    agent threshold is reached. (It's 3 seconds by default.)

    As an example, when three consequent router updates trigger keepalived
    respawn then:
    * the very first signal is sent as usual;
    * the second signal is deferred and sent in up to 3 seconds since the
      first signal;
    * the third signal is ignored, though the change that triggered it will
      be correctly applied by the second signal handler when it is triggered
      after threshold delay.

    If the last time a spawn request occurred is older than current-time
    minus threshold then there is no delay.

    Co-Authored-By: Jakub Libosvar <email address hidden>
    Co-Authored-By: Cedric Brandily <email address hidden>
    Co-Authored-By: Ihar Hrachyshka <email address hidden>

    Conflicts:
     neutron/agent/linux/keepalived.py
     neutron/common/utils.py
     neutron/tests/fullstack/test_l3_agent.py

    Closes-Bug: 1647432
    Change-Id: I2955e0de835458a2eea4dd088addf33b656f8670
    (cherry picked from commit 977d254cc69915819cf4226dc8cfc8c36969735b)

tags: added: in-stable-newton

This issue was fixed in the openstack/neutron 10.0.1 release.

This issue was fixed in the openstack/neutron 9.3.1 release.

This issue was fixed in the openstack/neutron 11.0.0.0b1 development milestone.

tags: removed: newton-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers