neutron

rootwrap sudo process goes into defunct state

Bug #1841682 reported by Colby Walsworth on 2019-08-27

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Low	Unassigned

Bug Description

OS: Centos 7.6
openstack Version: Rocky (13.0.4)

Seeing many of these defunct processes:

neutron 83749 1 0 Aug26 ? 00:00:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=bcd58bbf-8711-46c0-b7f5-252f448febe9 --namespace=qrouter-bcd58bbf-8711-46c0-b7f5-252f448febe9 --conf_dir=/var/lib/neutron/ha_confs/bcd58bbf-8711-46c0-b7f5-252f448febe9 --monitor_interface=ha-45ff175b-0f --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/bcd58bbf-8711-46c0-b7f5-252f448febe9.monitor.pid --state_path=/var/lib/neutron --user=427 --group=225 --AGENT-root_helper=sudo neutron-rootwrap /etc/neutron/rootwrap.conf --AGENT-root_helper_daemon=sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root 83750 83749 0 Aug26 ? 00:00:00 ip -o monitor address
root 83807 83749 0 Aug26 ? 00:00:00 [sudo] <defunct>

This is with l3_ha router setup with 2 network nodes.

Tags:

Lajos Katona (lajos-katona) on 2019-08-28

tags:

added: l3-ha

LIU Yulong (dragon889) on 2019-08-28

Changed in neutron:
assignee:	nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)

Revision history for this message

Lajos Katona (lajos-katona) wrote on 2019-08-29:

Hi, I have as well these kind of defunct processes without l3_ha, and on master.
AFAIK this is "normal" linux working, the parent of these processes disappeared, the children finished their tasks, and init will handle them finally.

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2019-08-29:

Yes, you are right Lajos, but this is usually because the process was killed, not finished. We should find the process killed, by who and how to finish it in a proper way.

Anyway, so far this bug is not a security issue or critical.

Lajos Katona (lajos-katona) on 2019-09-02

Changed in neutron:
importance:	Undecided → Low

Revision history for this message

Colby Walsworth (colbywalsworth) wrote on 2019-09-03:

Im not sure if it helps at all but I only see the zombie processes on the backup l3_ha node. The master node does not have any. We currently have all routers on one node due to patching and reboot of the second node.

Revision history for this message

John Haller (john-haller) wrote on 2020-03-09:

See https://opendev.org/openstack/oslo.rootwrap/commit/af8ad2da809f68442da9aacd17a47bca342eb355

I suspect is is a duplicate or at least related of one of these referenced in above link: #1658973 #1658977 #1663458

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2020-03-09:

Hello:

I don't know why do I have this bug assigned to my.

Just as a comment: since https://review.opendev.org/#/c/660611/, the IP monitoring is done using a parallel thread using Pyroute. The shell command "ip -o monitor" is not used anymore.

Regards.

Changed in neutron:
assignee:	Rodolfo Alonso (rodolfo-alonso-hernandez) → nobody

Revision history for this message

Fabian Zimmermann (dev-faz) wrote on 2020-07-15:

Hi,

having the same issue. Can confirm it only applies to the agent hosting the backup-state.

Did some debugging and it looks like the zombies are created as soon as the rootwrap-daemon is running in the daemon-timeout (default 600s).

This seems not to affect the function of the l3-agent. As soon as an rootwrap-action needs to be executed, it will just start a new rootwrap-daemon and execute the action.

It seems the problem will go away after at least one failover for this router was done by the agent / seems the timeout will not be triggered afterwards.

Dont think it has to do with the "ip -o monitor". Its more an issue with sudo and reaping processes.

Any ideas how to debug this sudo-issues?

Fabian

Revision history for this message

Fabian Zimmermann (dev-faz) wrote on 2020-07-15:

Another hint: It seems the zombie got reaped as soon as another daemon is started. Maybe the code is just not fetching its childs until another action is queued?

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2020-07-27:

Hi Fabian,

Can You tell us what version of neutron are You using? And if You have https://review.opendev.org/#/c/660611/ already or not?
If You don't have it, can You check if that would solve problem for You?

Revision history for this message

Damon Li (damonl1) wrote on 2020-11-12:

This issue should be fixed on this patch https://github.com/openstack/oslo.rootwrap/commit/c9a57aab082f55d525f003db61290b6ab7437b7c.

On oslo.rootwrap client, we use subprocess.Popen to open neutron-rootwrap-daemon process. And this process has a timeout(default 60s). If during this period, it is not called. It will be killed. At this time. We need to run wait, otherwise it will cause defunct process.

Please check if your oslo.rootwrap contains this patch or not.

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-01-10:

#10

Marking this bug fix based on the above comment. If you have that fix and the problem still persists please re-open.

Changed in neutron:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.