neutron

Neutron causes systemd to hang on Linux guests with SELinux disabled

Bug #1912379 reported by David Capone on 2021-01-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Expired	Undecided	Unassigned

Bug Description

We have observed an issue that exists at least in the Ussuri release where a linux guest VM will have its systemd process end up hung and utilize 100% CPU if SELinux is disabled or set to permissive.

As of now we have only only verified this issue with CentOS 7 and 8 guests as that is the only Linux OS we use.

We believe we have tracked the issue to something in Neutron and possibly more specifically to the way remote security group rules are processed and/or to an issue with inter-server communication when SELinux is disabled in the guest.

We have observed the same behavior on multiple deployments whether it is an all-in-one deployment with LVM backed cinder volumes or a multinode deployment with a ceph backend. What we have learned / observed so far is the following:

If SELinux is disabled/permissive in the guest VM AND the "Default" security group settings exist that have rule, "openstack security group rule create --remote-group default default", the systemd process will spike to 100% CPU usage (all cores) in the guest VM and requiring rebooting to clear the issue. The issue recurs several hours after reboot. The problem exists whether or not other VMs exist on the network and/or traffic is being passed. Simply having the ability to pass traffic causes the issue. Further, in the one test scenario we had and this issue was discovered that led us digging further into the cause, network performance between VMs with this configuration was poor and there would be / was high latency between the VMs on the network. It was a web server / mySql server and queries would have 10 second runtimes when being called from the web server but execute in milliseconds when run directly on the mySql server.

If the rule for "openstack security group rule create --remote-group default default" is removed from the server, the problem does not recur after a reboot.

Likewise, if SELinux is enabled in the guest, everything works fine.

We also ran strace on the systemd process in the guest VMs while the CPUs were pegged and all VMs exhibiting this behavior appeared to be stuck in a perpetual "wait" state. strace output from seemingly hung systemd process:

epoll_pwait(4, [], 1024, 196, NULL, 8) = 0
epoll_pwait(4, [], 1024, 443, NULL, 8) = 0
epoll_pwait(4, [], 1024, 49, NULL, 8) = 0
epoll_pwait(4, [], 1024, 500, NULL, 8) = 0
epoll_pwait(4, [], 1024, 447, NULL, 8) = 0
epoll_pwait(4, [], 1024, 52, NULL, 8) = 0

This goes on over and over and every once in a while in the middle of that we receive what appears to be a json request:

read(9, "\1\0\0\0\0\0\0\0", 1024) = 8
write(13, "{\"id\":92,\"jsonrpc\":\"2.0\",\"method"..., 248) = 248
epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=13, u64=13}}) = 0
epoll_pwait(4, [{EPOLLIN, {u32=13, u64=13}}], 1024, 335, NULL, 8) = 1
read(13, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 2048) = 376
futex(0xa153e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xa153e0, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1 14:27
epoll_pwait(4, [{EPOLLIN, {u32=9, u64=9}}], 1024, 135, NULL, 8) = 1

All setups that we are observing this behavior on were deployed with kolla-ansible and are running the latest Ussuri release. KVM for hypervisor on all.

Revision history for this message

Brian Haley (brian-haley) wrote on 2021-01-19:

So this is happening inside the guest and not on the hypervisor?

I guess I could see how some security group setting in Neutron could cause issues with certain packets, but that would be more the trigger, where it would (IMO) still be a bug in systemd to go into a loop in this scenario. If it is packet-related, you should be able to reproduce the same issue without Neutron just inserting some iptables rules on the guest to drop certain packets.

I would think reproducing on another distro besides Centos would be a good first step, both Ubuntu and Debian cloud images should be readily available.

The command "openstack security group rule create --remote-group default default" could cause the underlying hypervisor to become very busy if the related set of IP addresses is quite large, as it calculates what rules to insert (neutron-ovs-agent), but it should be a short-lived spike in CPU usage.

Changed in neutron:
status:	New → Incomplete

Revision history for this message

David Capone (dcapone2004) wrote on 2021-01-19:

Correct. It happens inside the guest. At the hypervisor level you can observe the qemu-kvm process utilizing 100% of the cores allocated to the VM.

As for the packet question, as mentioned in my original post, the issue occurs even if there is no specific traffic targeted to a VM. I guess broadcast traffic might be hitting the VM. In other words, we discovered this issue because we finally had these narrow conditions occur on VMs that had decent usage and a certain expected level of performance. But, in tracking down all the circumstances around the problem, we have found VMs in other tenants where there was no direct communication expected between instances on the neutron network. They all resided on the same Neutron network, and all had the "default" security group assigned since it is well, the default. However, the servers were unrelated where there was never a case where server A specifically attempted to access Server B on the neutron network and yet the CPU spike still occurred inside the VM. It was never noticed previously as the handful of VMs that we have where SELinux was disabled were development machines and never tested for performance. The VMs we discovered this on actually should not have had SELinux disabled, so the discovery of this entire issue was sort of the result of someone incorrectly disabling SELinux on these VMs.

Also, keep in mind that the issue manifests itself when traffic is ALLOWED via Neutron NOT BLOCKED, so I am unsure how I could reproduce it with IPtables / firewalld. When the traffic is blocked in Neutron, the issue does not occur. What I could do is sort of the opposite of what I think you were suggesting and allow the traffic via Neutron, and block it via IPtables.

Finally, my reference to "openstack security group rule create --remote-group default default" was merely to try to reference in text form what rule I believe causes the issue. This rule is not explicitly created by us, it is a default rule in the default security group for new tenants (at least on the KA installs).

Correct.  It happens inside the guest.  At the hypervisor level you can observe the qemu-kvm process utilizing 100% of the cores allocated to the VM.

As for the packet question, as mentioned in my original post, the issue occurs even if there is no specific traffic targeted to a VM.  I guess broadcast traffic might be hitting the VM.  In other words, we discovered this issue because we finally had these narrow conditions occur on VMs that had decent usage and a certain expected level of performance.  But, in tracking down all the circumstances around the problem, we have found VMs in other tenants where there was no direct communication expected between instances on the neutron network.  They all resided on the same Neutron network, and all had the "default" security group assigned since it is well, the default.  However, the servers were unrelated where there was never a case where server A specifically attempted to access Server B on the neutron network and yet the CPU spike still occurred inside the VM.  It was never noticed previously as the handful of VMs that we have where SELinux was disabled were development machines and never tested for performance.  The VMs we discovered this on actually should not have had SELinux disabled, so the discovery of this entire issue was sort of the result of someone incorrectly disabling SELinux on these VMs.

Also, keep in mind that the issue manifests itself when traffic is ALLOWED via Neutron NOT BLOCKED, so I am unsure how I could reproduce it with IPtables / firewalld.  When the traffic is blocked in Neutron, the issue does not occur.  What I could do is sort of the opposite of what I think you were suggesting and allow the traffic via Neutron, and block it via IPtables.

Finally, my reference to "openstack security group rule create --remote-group default default" was merely to try to reference in text form what rule I believe causes the issue.  This rule is not explicitly created by us, it is a default rule in the default security group for new tenants (at least on the KA installs).

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-03-21:

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.