rootwrap sudo process goes into defunct state

Bug #1841682 reported by Colby Walsworth
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Unassigned

Bug Description

OS: Centos 7.6
openstack Version: Rocky (13.0.4)

Seeing many of these defunct processes:

neutron 83749 1 0 Aug26 ? 00:00:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=bcd58bbf-8711-46c0-b7f5-252f448febe9 --namespace=qrouter-bcd58bbf-8711-46c0-b7f5-252f448febe9 --conf_dir=/var/lib/neutron/ha_confs/bcd58bbf-8711-46c0-b7f5-252f448febe9 --monitor_interface=ha-45ff175b-0f --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/bcd58bbf-8711-46c0-b7f5-252f448febe9.monitor.pid --state_path=/var/lib/neutron --user=427 --group=225 --AGENT-root_helper=sudo neutron-rootwrap /etc/neutron/rootwrap.conf --AGENT-root_helper_daemon=sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root 83750 83749 0 Aug26 ? 00:00:00 ip -o monitor address
root 83807 83749 0 Aug26 ? 00:00:00 [sudo] <defunct>

This is with l3_ha router setup with 2 network nodes.

Tags: l3-ha
tags: added: l3-ha
LIU Yulong (dragon889)
Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Hi, I have as well these kind of defunct processes without l3_ha, and on master.
AFAIK this is "normal" linux working, the parent of these processes disappeared, the children finished their tasks, and init will handle them finally.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Yes, you are right Lajos, but this is usually because the process was killed, not finished. We should find the process killed, by who and how to finish it in a proper way.

Anyway, so far this bug is not a security issue or critical.

Changed in neutron:
importance: Undecided → Low
Revision history for this message
Colby Walsworth (colbywalsworth) wrote :

Im not sure if it helps at all but I only see the zombie processes on the backup l3_ha node. The master node does not have any. We currently have all routers on one node due to patching and reboot of the second node.

Revision history for this message
John Haller (john-haller) wrote :

See https://opendev.org/openstack/oslo.rootwrap/commit/af8ad2da809f68442da9aacd17a47bca342eb355

I suspect is is a duplicate or at least related of one of these referenced in above link: #1658973 #1658977 #1663458

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I don't know why do I have this bug assigned to my.

Just as a comment: since https://review.opendev.org/#/c/660611/, the IP monitoring is done using a parallel thread using Pyroute. The shell command "ip -o monitor" is not used anymore.

Regards.

Changed in neutron:
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → nobody
Revision history for this message
Fabian Zimmermann (dev-faz) wrote :

Hi,

having the same issue. Can confirm it only applies to the agent hosting the backup-state.

Did some debugging and it looks like the zombies are created as soon as the rootwrap-daemon is running in the daemon-timeout (default 600s).

This seems not to affect the function of the l3-agent. As soon as an rootwrap-action needs to be executed, it will just start a new rootwrap-daemon and execute the action.

It seems the problem will go away after at least one failover for this router was done by the agent / seems the timeout will not be triggered afterwards.

Dont think it has to do with the "ip -o monitor". Its more an issue with sudo and reaping processes.

Any ideas how to debug this sudo-issues?

 Fabian

Revision history for this message
Fabian Zimmermann (dev-faz) wrote :

Another hint: It seems the zombie got reaped as soon as another daemon is started. Maybe the code is just not fetching its childs until another action is queued?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Fabian,

Can You tell us what version of neutron are You using? And if You have https://review.opendev.org/#/c/660611/ already or not?
If You don't have it, can You check if that would solve problem for You?

Revision history for this message
Damon Li (damonl1) wrote :

This issue should be fixed on this patch https://github.com/openstack/oslo.rootwrap/commit/c9a57aab082f55d525f003db61290b6ab7437b7c.

On oslo.rootwrap client, we use subprocess.Popen to open neutron-rootwrap-daemon process. And this process has a timeout(default 60s). If during this period, it is not called. It will be killed. At this time. We need to run wait, otherwise it will cause defunct process.

Please check if your oslo.rootwrap contains this patch or not.

Revision history for this message
Brian Haley (brian-haley) wrote :

Marking this bug fix based on the above comment. If you have that fix and the problem still persists please re-open.

Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.