nova-compute leaks green threads

Bug #1971760 reported by Mohammed Naser
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

At the moment, if the cloud sustain a large number of VIF plugging timeouts, it will lead into a ton of leaked green threads which can cause the nova-compute process to stop reporting/responding.

The tracebacks that would occur would be:

===
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] Traceback (most recent call last):
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7230, in _create_guest_with_network
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] guest = self._create_guest(
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] next(self.gen)
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 479, in wait_for_instance_event
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] actual_event = event.wait()
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/var/lib/openstack/lib/python3.8/site-packages/eventlet/event.py", line 125, in wait
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] result = hub.switch()
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/var/lib/openstack/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] return self.greenlet.switch()
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] eventlet.timeout.Timeout: 300 seconds
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b]
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] During handling of the above exception, another exception occurred:
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b]
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] Traceback (most recent call last):
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 2409, in _build_and_run_instance
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] self.driver.spawn(context, instance, image_meta,
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 4193, in spawn
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] self._create_guest_with_network(
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7256, in _create_guest_with_network
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] raise exception.VirtualInterfaceCreateException()
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b] nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-04-17 00:21:28.651 877893 ERROR nova.compute.manager [instance: 0c0d2422-781c-4bd2-b6bd-e5e3c94b602b]
===

Eventually, with enough of these, the nova-compute process would hang. The output of GMR shows nearly 6094 threads, with around 3038 of them having the traceback below:

===
------ Green Thread ------

/var/lib/openstack/lib/python3.8/site-packages/eventlet/hubs/hub.py:355 in run
    `self.fire_timers(self.clock())`

/var/lib/openstack/lib/python3.8/site-packages/eventlet/hubs/hub.py:476 in fire_timers
    `timer()`

/var/lib/openstack/lib/python3.8/site-packages/eventlet/hubs/timer.py:59 in __call__
    `cb(*args, **kw)`

/var/lib/openstack/lib/python3.8/site-packages/eventlet/hubs/__init__.py:151 in _timeout
    `current.throw(exc)`
===

In addition, 3039 of those threads would output the following:

===
------ Green Thread ------

No Traceback!
===

In total, that puts 6077 green threads in total with that weird state. We've had a discussion about this here, and it seems that it may be related to the use of `spawn_n`.

https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2022-05-05.log.html#t2022-05-05T16:20:37

Revision history for this message
Mohammed Naser (mnaser) wrote :
Revision history for this message
Mohammed Naser (mnaser) wrote :

I have a full log of a failure here:

https://paste.opendev.org/show/bbd1luUAF8B0rwZnl4Cw/

Revision history for this message
Mohammed Naser (mnaser) wrote :

I've got another box with this:

# of "No traceback!" threads => 5149
# of "current.throw(exc)" threads => 5147

Total number of green threads: 10313. It's interesting that the amount of "No traceback!" green threads is almost exactly the number of the other one.

Revision history for this message
Mohammed Naser (mnaser) wrote :

FYI, this is wallaby

Revision history for this message
Mohammed Naser (mnaser) wrote :

Interestingly enough, this behaviour makes it so that no VMs spawn on this n-cpu process anymore, even if it has space:

```
[root@ctl1-rdo ~]# openstack resource provider inventory list bf0a1b28-b185-4f4b-b80a-8cf871c4b015
+----------------+------------------+----------+----------+----------+-----------+--------+------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------+------------------+----------+----------+----------+-----------+--------+------+
| VCPU | 5.0 | 1 | 96 | 0 | 1 | 96 | 8 |
| MEMORY_MB | 1.0 | 1 | 524147 | 32674 | 1 | 524147 | 8192 |
| DISK_GB | 3.0 | 1 | 1760 | 0 | 1 | 1760 | 80 |
+----------------+------------------+----------+----------+----------+-----------+--------+------+
[root@ctl1-rdo ~]# openstack allocation candidate list --resource VCPU=4 --resource DISK_GB=64 --resource MEMORY_MB=2048 | grep bf0a1b28-b185-4f4b-b80a-8cf871c4b015
| 4 | VCPU=4,DISK_GB=64,MEMORY_MB=2048 | bf0a1b28-b185-4f4b-b80a-8cf871c4b015 | VCPU=8/480,MEMORY_MB=8192/491473,DISK_GB=80/5280 | HW_CPU_X86_SSE2,COMPUTE_DEVICE_TAGGING,HW_CPU_HYPERTHREADING,HW_CPU_X86_CLMUL,HW_CPU_X86_SSE,HW_CPU_X86_ABM,HW_CPU_X86_SSE4A,HW_CPU_X86_MMX,HW_CPU_X86_SHA,HW_CPU_X86_AVX2,HW_CPU_X86_SSE41,HW_CPU_X86_SSE42,COMPUTE_TRUSTED_CERTS,HW_CPU_X86_AESNI,HW_CPU_X86_SVM,COMPUTE_VOLUME_EXTEND,COMPUTE_NET_ATTACH_INTERFACE_WITH_TAG,HW_CPU_X86_AVX,HW_CPU_X86_BMI2,COMPUTE_NET_ATTACH_INTERFACE,COMPUTE_VOLUME_ATTACH_WITH_TAG,HW_CPU_X86_FMA3,COMPUTE_VOLUME_MULTI_ATTACH,HW_CPU_X86_SSSE3,HW_CPU_X86_F16C,HW_CPU_X86_BMI,COMPUTE_IMAGE_TYPE_ARI,COMPUTE_IMAGE_TYPE_ISO,COMPUTE_STATUS_DISABLED,COMPUTE_IMAGE_TYPE_AMI,COMPUTE_IMAGE_TYPE_AKI,COMPUTE_IMAGE_TYPE_RAW,COMPUTE_IMAGE_TYPE_QCOW2 |
```

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Note that spawn_n creates naked greenlets while spawn wraps the greenlet with a GreenThread class. And that class adds a lot of behavior. So, yes, spawn_n can behave differently.

I did some tracing in a devstack (nova master, eventlet 0.33.0) but I cannot reproduce the eventlet / greenlet leak with simple vif plug timeout.

trace patch: https://paste.opendev.org/show/bpAHyTTFJpyynPbJjN33/
nova logs: https://paste.opendev.org/show/bQ1UBNrPSTD9xNjX4fIg/

From the logs I see that the number of greenlets increase while nova waits for the vif-plug event, but after the timeout happens the thread count goes back to the steady state it was before the boot.

I will try this again with a wallaby devstack.

@Mohammed: what is the eventlet version in you system? Was there any other failure on these computes other than the vif-plug timeout?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I tried the same thing in a wallaby devstack (eventlet 0.30.2). While I temporarily can increase the number of existing greenlets by doing VM boot / delete operations, after couple of minutes idle time the number of greenlets go back to the steady number I saw before the VM boot / deletes. I've also tried vif-plug timeout and also port binding fail cases. Neither triggered leaked greenlets for me.

I've also pushed two patches (one master[1] and one wallaby[2]) that periodically logs the number of greenlets in the nova-compute service. From the nova-next job runs I've collected the data [3] both from master and wallaby. The naked greenlet number seems to be ever increasing in both runs. However locally the same excess greenlets disappear after some idle time.

Bottom line I'm not able to reproduce the leak locally.

[1] https://review.opendev.org/c/openstack/nova/+/841040
[2] https://review.opendev.org/c/openstack/nova/+/841041
[3] https://docs.google.com/spreadsheets/d/1nRfCK_pg-vlfv6w4PSzmKCBXC3Jv1XvxEfd2G-PtNqg/edit?usp=sharing

Changed in nova:
status: New → Incomplete
Revision history for this message
Mohammed Naser (mnaser) wrote :

Is it possible that this is related?

https://review.opendev.org/c/openstack/nova/+/840993

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/842359

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "sean mooney <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/842359
Reason: this did not really seam to change much so im goign to abandon it for now we can revisit later if this seams to be requried.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.