keepalived using genhash keeps getting "zombied" and using CPU to a 100%

Bug #160426 reported by Adrian Almenar on 2007-11-06
6
Affects Status Importance Assigned to Milestone
keepalived (Ubuntu)
Medium
Andres Rodriguez

Bug Description

Binary package hint: keepalived

When using keepalived last version from Dapper, when it has a genhash failure on http service check it keeps writing every second to logs filling the hard drive, using 100% CPU and the checker get "zombie". A restart of the process does not work on it, and if it happens on the master server, it also starts on the slave server, making the service unavailable in 90% of the requests. I installed by hand compiling 1.1.15 and problem goes away for now, with 1.1.14 the issue still happens.

Log Trace:
Nov 4 06:43:41 fw02 Keepalived: Starting Healthcheck child process, pid=9787
Nov 4 06:43:41 fw02 Keepalived_healthcheckers: Registering Kernel netlink reflector
Nov 4 06:43:41 fw02 Keepalived: VRRP child process(9786) died: Respawning
Nov 4 06:43:41 fw02 Keepalived_healthcheckers: Registering Kernel netlink command channel
Nov 4 06:43:41 fw02 Keepalived: Remove a zombie pid file /var/run/vrrp.pid
Nov 4 06:43:41 fw02 Keepalived_healthcheckers: Configuration is using : 31701 Bytes
Nov 4 06:43:41 fw02 Keepalived_vrrp: Using MII-BMSR NIC polling thread...
Nov 4 06:43:41 fw02 Keepalived: Starting VRRP child process, pid=9788
Nov 4 06:43:42 fw02 Keepalived_vrrp: Registering Kernel netlink reflector
Nov 4 06:43:42 fw02 Keepalived: Healthcheck child process(9787) died: Respawning
Nov 4 06:43:42 fw02 Keepalived_vrrp: Registering Kernel netlink command channel
Nov 4 06:43:42 fw02 Keepalived: Remove a zombie pid file /var/run/checkers.pid
Nov 4 06:43:42 fw02 Keepalived_vrrp: Registering gratutious ARP shared channel
Nov 4 06:43:42 fw02 Keepalived_healthcheckers: Using MII-BMSR NIC polling thread...
Nov 4 06:43:42 fw02 Keepalived: Starting Healthcheck child process, pid=9789
Nov 4 06:43:42 fw02 Keepalived_healthcheckers: Registering Kernel netlink reflector
Nov 4 06:43:42 fw02 Keepalived_healthcheckers: Registering Kernel netlink command channel
Nov 4 06:43:42 fw02 Keepalived: VRRP child process(9788) died: Respawning
Nov 4 06:43:42 fw02 Keepalived_healthcheckers: Configuration is using : 31701 Bytes
Nov 4 06:43:42 fw02 Keepalived: Remove a zombie pid file /var/run/vrrp.pid
Nov 4 06:43:42 fw02 Keepalived_vrrp: Using MII-BMSR NIC polling thread...
Nov 4 06:43:42 fw02 Keepalived: Starting VRRP child process, pid=9790
Nov 4 06:43:42 fw02 Keepalived_vrrp: Registering Kernel netlink reflector
Nov 4 06:43:42 fw02 Keepalived: Healthcheck child process(9789) died: Respawning
Nov 4 06:43:42 fw02 Keepalived_vrrp: Registering Kernel netlink command channel
Nov 4 06:43:42 fw02 Keepalived: Remove a zombie pid file /var/run/checkers.pid
Nov 4 06:43:42 fw02 Keepalived_vrrp: Registering gratutious ARP shared channel
Nov 4 06:43:42 fw02 Keepalived_healthcheckers: Using MII-BMSR NIC polling thread...

Adrian Almenar (aalmenar) wrote :

With 1.1.13 from debian unstable works ok, this does not happen.

Koen Beek (koen-beek) wrote :

Hardy has been synced with 1.1.15 on 9-1-08

Adam Logghe (alogghe) wrote :

Im having this same probem with latest 1.1.15 package in hardy...

Andres Rodriguez (andreserl) wrote :

Do you still have this error? Have you try it in jaunty? Would you provide steps to reproduce it, and sample config?

Changed in keepalived (Ubuntu):
assignee: nobody → Ubuntu High Availability Team (ubuntu-ha)
status: New → Incomplete
Andres Rodriguez (andreserl) wrote :

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in keepalived (Ubuntu):
status: Incomplete → Invalid
Changed in keepalived (Ubuntu):
status: Invalid → New
Andrey Oleynik (andrey-oleynik) wrote :

I have the same problem.

Keepalived v1.1.17 (02/10,2010)

Config:
****
virtual_server localhost 8443 {

    delay_loop 5
    protocol TCP

    real_server localhost 8443 {

        notify_up "logger -p daemon.warn -t Keepalived 'Proxy UP';"
        notify_down "logger -p daemon.err -t Keepalived 'Proxy DOWN, restarting...'; service vrrp restart; service nginx restart"

        SSL_GET {
            url {
              path /stub
              status_code 200
            }
            connect_port 8443
            connect_timeout 3
            delay_before_retry 3
        }
    }

    real_server localhost 8443 {

        notify_up "logger -p daemon.warn -t Keepalived 'VRRPd UP';"
        notify_down "logger -p daemon.err -t Keepalived 'VRRPd DOWN, restarting...'; service vrrp restart"

****

The problem is reproduced after rebooting the computer. It is writing the following messages to log:
May 31 18:24:36 int-top-lb2 Keepalived: Remove a zombie pid file /var/run/checkers.pid
May 31 18:24:36 int-top-lb2 Keepalived: Starting Healthcheck child process, pid=3738
May 31 18:24:36 int-top-lb2 Keepalived: Healthcheck child process(3738) died: Respawning
May 31 18:24:36 int-top-lb2 Keepalived: Remove a zombie pid file /var/run/checkers.pid
May 31 18:24:36 int-top-lb2 Keepalived: Starting Healthcheck child process, pid=3740
May 31 18:24:36 int-top-lb2 Keepalived: Healthcheck child process(3740) died: Respawning
May 31 18:24:36 int-top-lb2 Keepalived: Remove a zombie pid file /var/run/checkers.pid
May 31 18:24:36 int-top-lb2 Keepalived: Starting Healthcheck child process, pid=3742
and so on

As we have found after execution of "modprob ip_vs" keepalived become to normal work

Andres Rodriguez (andreserl) wrote :

Hi there,

Does this problem only happens right after a reboot? IIRC, this might be related to having keepalived after networking is started.

 it is up to upstart/sysvinit to start the daemon, and then this happens?
Are you obtaining an IP address with DHCP or is it statically configured?

Could you try to stop and restart the service and see if the issue is seen again?

Thank you in advance!

Regards

Changed in keepalived (Ubuntu):
assignee: Ubuntu High Availability Team (ubuntu-ha) → Andres Rodriguez (andreserl)
Andrey Oleynik (andrey-oleynik) wrote :

Yes, this porblem happens right after a reboot and if I try to restart the service result the same.

I'm obtaining an IP dynamically by VRRP. It has one statically configured IP and additionally gets second if this computer becomes master router.

Clint Byrum (clint-fewbar) wrote :

Setting Importance to Medium. 5 year old bug that has only one reporter, so the scope of the impact seems fairly limited.

Changed in keepalived (Ubuntu):
importance: Undecided → Medium
Adam Logghe (alogghe) wrote :

There is clearly more than one reporter (I'm one of them).

I worked around it by compiling from source.

Keepalived IF you are using it is liking to be a HIGHLY critical service (I am not currently but will again).

Maybe it's still a medium but just to put a counterpoint to your comment Clint.

Clint Byrum (clint-fewbar) wrote :

Adam, thanks for the response! With that, I'll mark it as Confirmed.

The symptoms all seem to center around the order that modules are loaded and perhaps how networking is handled. Perhaps keepalived needs to be delayed until after all networking. I have to wonder if this still affects 12.04, since runlevel 2 is now delayed until all interfaces in /etc/network/interfaces have been brought up.

Changed in keepalived (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers