Bug #1602320 “ha + distributed router: keepalived process kill ...” : Bugs : neutron

Revision history for this message

Assaf Muller (amuller) wrote on 2016-07-12:

#1

The L3 agent sends SIGHUP to keepalived to reconfigure it every time the router is changed through the API: Add/remove a router interface or add/remove a floating IP. Are you sure that's not what's happening here?

Also, does this reproduce with a router that is HA but not distributed?

Revision history for this message

Brandon Logan (brandon-logan) wrote on 2016-07-12:

#2

So you're saying this has only occured when your nodes are containers and not VMs?

Revision history for this message

Dongcan Ye (hellochosen) wrote on 2016-07-12:

#3

@Assaf, I test remove a router interface after the add operation, and trace the log, the SIGHUP received by vrrp process, but is was not killed.
Test only HA router seems normally, it can't reproduce.

@Brandon Logan, yes, this occurs in containers.

Revision history for this message

Dongcan Ye (hellochosen) wrote on 2016-07-14:

#4

@Assaf, could you give me some idea to troubleshoot this problem, thanks.

Dongcan Ye (hellochosen) on 2016-07-15

Changed in neutron:
assignee:	nobody → Dongcan Ye (hellochosen)

Revision history for this message

Dongcan Ye (hellochosen) wrote on 2016-07-15:

#5

@Brandon Logan, this also occurs in VMs.

@Assaf and Brandon, I have an workaround for this.
We can add "-R" params for keepalived, this param will prevent the vrrp child processes from respawning.
When this option is specified, if either the checker or vrrp child processes exit the parent process will raise the SIGTERM signal and exit.[1]

In some situation, when vrrp subprocess killed by SIGHUP signal, we can stop and then start Keepalived.
This can avoid error ha state for HA router.

[1] http://www.keepalived.org/changelog.html

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-15: Fix proposed to neutron (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/342730

Changed in neutron:
status:	New → In Progress

Revision history for this message

John Schwarz (jschwarz) wrote on 2016-07-25:

#7

Please note that, as I understand it from the descriptions going on in this thread, that once an update is required the keepalived process will restart (it will kill itself and then the l3 agent will need to re-start it). This will, in turn, trigger the bug described in [1] and might cause an active/active configuration anyway.

[1]: https://bugs.launchpad.net/neutron/+bug/1597461

Revision history for this message

Dongcan Ye (hellochosen) wrote on 2016-07-25:

#8

@John, the things is that only the vrrp child process restart, the keepalived process still alive.
We can kill -9 vrrp_childprocess simulate this problem.

OpenStack Infra (hudson-openstack) on 2016-09-07

Changed in neutron:
assignee:	Dongcan Ye (hellochosen) → He Qing (tsinghe-7)

Revision history for this message

He Qing (tsinghe-7) wrote on 2016-09-07:

#9

The root cause is l3 agent send SIGHUP signal TWICE, cause VRRP process terninated. Vip addresses and routes were left over.
Keepalived will restart VRRP process once it find VRRP terminated and then will start a re-election between VRRP peers. If the former master transition to backup after election, there will be two active agents showed in Neutron.

Here is the strace result of keepalived when adding a new interface to router:

     1.001115 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)
     1.001102 select(1024, [4], [], [], {0, 934821}) = 0 (Timeout)
     0.935917 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)
     1.001114 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.123360 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=33319, si_uid=0} ---
     0.000031 write(5, "\1\0\0\0", 4) = 4
     0.000048 rt_sigreturn() = -1 EINTR (Interrupted system call)
     0.000041 read(4, "\1\0\0\0", 4) = 4
     0.000040 kill(33081, SIGHUP) = 0
     0.000052 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000046 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.003571 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=33320, si_uid=0} ---
     0.000028 write(5, "\1\0\0\0", 4) = 4
     0.000049 rt_sigreturn() = -1 EINTR (Interrupted system call)
     0.000042 read(4, "\1\0\0\0", 4) = 4
     0.000038 kill(33081, SIGHUP) = 0
     0.000035 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000057 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.032646 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=33081, si_status=SIGHUP, si_utime=0, si_stime=0} ---
     0.000027 write(5, "\21\0\0\0", 4) = 4
     0.000044 rt_sigreturn() = -1 EINTR (Interrupted system call)
     0.000039 read(4, "\21\0\0\0", 4) = 4
     0.000040 wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGHUP}], WNOHANG, NULL) = 33081
     0.000057 wait4(-1, 0x7ffeb5a249a4, WNOHANG, NULL) = -1 ECHILD (No child processes)
     0.000038 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000052 sendto(3, "<25>Sep 7 10:34:46 Keepalived[9"..., 80, MSG_NOSIGNAL, NULL, 0) = 80
     0.000046 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f3661e90b10) = 33321
     0.000327 sendto(3, "<30>Sep 7 10:34:46 Keepalived[9"..., 76, MSG_NOSIGNAL, NULL, 0) = 76
     0.000049 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)

I started a patch here:
https://review.openstack.org/#/c/366493/

The root cause is l3 agent send SIGHUP signal TWICE, cause VRRP process terninated. Vip addresses and routes were left over.
Keepalived will restart VRRP process once it find VRRP terminated and then will start a re-election between VRRP peers. If the former master transition to backup after election, there will be two active agents showed in Neutron.

Here is the strace result of keepalived when adding a new interface to router:

1.001115 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)
     1.001102 select(1024, [4], [], [], {0, 934821}) = 0 (Timeout)
     0.935917 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)
     1.001114 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.123360 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=33319, si_uid=0} ---
     0.000031 write(5, "\1\0\0\0", 4)   = 4
     0.000048 rt_sigreturn()            = -1 EINTR (Interrupted system call)
     0.000041 read(4, "\1\0\0\0", 4)    = 4
     0.000040 kill(33081, SIGHUP)       = 0
     0.000052 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000046 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.003571 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=33320, si_uid=0} ---
     0.000028 write(5, "\1\0\0\0", 4)   = 4
     0.000049 rt_sigreturn()            = -1 EINTR (Interrupted system call)
     0.000042 read(4, "\1\0\0\0", 4)    = 4
     0.000038 kill(33081, SIGHUP)       = 0
     0.000035 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000057 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.032646 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=33081, si_status=SIGHUP, si_utime=0, si_stime=0} ---
     0.000027 write(5, "\21\0\0\0", 4)  = 4
     0.000044 rt_sigreturn()            = -1 EINTR (Interrupted system call)
     0.000039 read(4, "\21\0\0\0", 4)   = 4
     0.000040 wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGHUP}], WNOHANG, NULL) = 33081
     0.000057 wait4(-1, 0x7ffeb5a249a4, WNOHANG, NULL) = -1 ECHILD (No child processes)
     0.000038 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000052 sendto(3, "<25>Sep  7 10:34:46 Keepalived[9"..., 80, MSG_NOSIGNAL, NULL, 0) = 80
     0.000046 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f3661e90b10) = 33321
     0.000327 sendto(3, "<30>Sep  7 10:34:46 Keepalived[9"..., 76, MSG_NOSIGNAL, NULL, 0) = 76
     0.000049 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)
 
I started a patch here:
https://review.openstack.org/#/c/366493/

John Schwarz (jschwarz) on 2016-09-07

Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-09: Fix merged to neutron (master)

#10

Reviewed: https://review.openstack.org/366493
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2b148c3f9299642e0bb068983de68ec6441a23be
Submitter: Jenkins
Branch: master

commit 2b148c3f9299642e0bb068983de68ec6441a23be
Author: He Qing <email address hidden>
Date: Wed Sep 7 05:07:25 2016 +0000

Fix wrong HA router state

When we add/remove router interface from HA router, l3 agent
will send SIGHUP signal to keepalived for reloading configuraion.

    But for DVR+HA router, l3 agent will send SIGHUP signal TWICE which
    will cause VRRP sub-process terminated and vip addresses and routes
    left over. Keepalived then restart VRRP process and there will be
    a re-election between VRRP peers. After the election, if the former
    is still master, the state showed from Neutron will be correct. But
    if the former master transitioned to backup, the new VRRP process
    will NOT delete vips and routes because it is not the one who
    configured them. There will be two active agent showed from Neutron.

    HaRouter.enable_keepalived() will send SIGHUP signal to keepalived.
    DvrEdgeHaRouter.process() should not call enable_keepalived() by
    itself because it has inherited from class HaRouter.

Closes-Bug: 1602320
Change-Id: I647269665a22b4becb3e326e1f4b03ddd961d6b1

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-09: Fix proposed to neutron (stable/mitaka)

#11

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/367960

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-12: Fix merged to neutron (stable/mitaka)

#12

Reviewed: https://review.openstack.org/367960
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fdaef357ea7357e67c74e4fc80608347496c3d7e
Submitter: Jenkins
Branch: stable/mitaka

commit fdaef357ea7357e67c74e4fc80608347496c3d7e
Author: He Qing <email address hidden>
Date: Wed Sep 7 05:07:25 2016 +0000

Fix wrong HA router state

When we add/remove router interface from HA router, l3 agent
will send SIGHUP signal to keepalived for reloading configuraion.

    But for DVR+HA router, l3 agent will send SIGHUP signal TWICE which
    will cause VRRP sub-process terminated and vip addresses and routes
    left over. Keepalived then restart VRRP process and there will be
    a re-election between VRRP peers. After the election, if the former
    is still master, the state showed from Neutron will be correct. But
    if the former master transitioned to backup, the new VRRP process
    will NOT delete vips and routes because it is not the one who
    configured them. There will be two active agent showed from Neutron.

    HaRouter.enable_keepalived() will send SIGHUP signal to keepalived.
    DvrEdgeHaRouter.process() should not call enable_keepalived() by
    itself because it has inherited from class HaRouter.

    Closes-Bug: 1602320
    Change-Id: I647269665a22b4becb3e326e1f4b03ddd961d6b1
    (cherry picked from commit 2b148c3f9299642e0bb068983de68ec6441a23be)

tags:

added: in-stable-mitaka

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Change abandoned on neutron (master)

#13

Change abandoned by Dongcan Ye (<email address hidden>) on branch: master
Review: https://review.openstack.org/342730
Reason: Fixed in https://review.openstack.org/#/c/366493/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-26: Fix included in openstack/neutron 9.0.0.0rc1

#14

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-11: Fix included in openstack/neutron 8.3.0

#15

This issue was fixed in the openstack/neutron 8.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-18: Fix included in openstack/neutron 9.0.0.0rc1

#16

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-10: Fix included in openstack/neutron 8.3.0

#17

This issue was fixed in the openstack/neutron 8.3.0 release.

neutron

ha + distributed router: keepalived process kill vrrp child process

Bug Description

Other bug subscribers

Remote bug watches