Comment 9 for bug 1602320

Revision history for this message
He Qing (tsinghe-7) wrote :

The root cause is l3 agent send SIGHUP signal TWICE, cause VRRP process terninated. Vip addresses and routes were left over.
Keepalived will restart VRRP process once it find VRRP terminated and then will start a re-election between VRRP peers. If the former master transition to backup after election, there will be two active agents showed in Neutron.

Here is the strace result of keepalived when adding a new interface to router:

     1.001115 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)
     1.001102 select(1024, [4], [], [], {0, 934821}) = 0 (Timeout)
     0.935917 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)
     1.001114 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.123360 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=33319, si_uid=0} ---
     0.000031 write(5, "\1\0\0\0", 4) = 4
     0.000048 rt_sigreturn() = -1 EINTR (Interrupted system call)
     0.000041 read(4, "\1\0\0\0", 4) = 4
     0.000040 kill(33081, SIGHUP) = 0
     0.000052 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000046 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.003571 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=33320, si_uid=0} ---
     0.000028 write(5, "\1\0\0\0", 4) = 4
     0.000049 rt_sigreturn() = -1 EINTR (Interrupted system call)
     0.000042 read(4, "\1\0\0\0", 4) = 4
     0.000038 kill(33081, SIGHUP) = 0
     0.000035 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000057 select(1024, [4], [], [], {1, 0}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.032646 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=33081, si_status=SIGHUP, si_utime=0, si_stime=0} ---
     0.000027 write(5, "\21\0\0\0", 4) = 4
     0.000044 rt_sigreturn() = -1 EINTR (Interrupted system call)
     0.000039 read(4, "\21\0\0\0", 4) = 4
     0.000040 wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGHUP}], WNOHANG, NULL) = 33081
     0.000057 wait4(-1, 0x7ffeb5a249a4, WNOHANG, NULL) = -1 ECHILD (No child processes)
     0.000038 read(4, 0x7ffeb5a249d4, 4) = -1 EAGAIN (Resource temporarily unavailable)
     0.000052 sendto(3, "<25>Sep 7 10:34:46 Keepalived[9"..., 80, MSG_NOSIGNAL, NULL, 0) = 80
     0.000046 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f3661e90b10) = 33321
     0.000327 sendto(3, "<30>Sep 7 10:34:46 Keepalived[9"..., 76, MSG_NOSIGNAL, NULL, 0) = 76
     0.000049 select(1024, [4], [], [], {1, 0}) = 0 (Timeout)

I started a patch here:
https://review.openstack.org/#/c/366493/