OpenStack Compute (nova)

nova-compute become stuck when doing IO on busy file system

Bug #1874032 reported by Alexandre arents on 2020-04-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Wishlist	Alexandre arents

Bug Description

Description
===========

It seems that when nova-compute process run I/O intensive task on a busy file system,
it can become stuck and got disconnected from rabbitmq cluster.

From my understanding nova-compute do not use true OS multithreading,
but internal python multi-tasking also called cooperative multitasking,
This have some limitations like using only one core at a time,
and task did not schedule well under IO workload [1]

It look likes a "known" limitation [2]:
"Periodic tasks are important to understand because of limitations
in the threading model that OpenStack uses.
OpenStack uses cooperative threading in Python,
which means that if something long and complicated is running,
it will block other tasks inside that process from running unless it voluntarily yields execution to another cooperative thread."

I think we hit the same limitation than periodic task when we do a snapshot in qcow2/raw environment, during glance upload, if the snapshot directory is on a busy file system.

[1] https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&ved=2ahUKEwjpmujvpvfoAhUD0BoKHRkgBtIQFjAFegQICBAB&url=https%3A%2F%2Fwww.dabeaz.com%2Fpython%2FUnderstandingGIL.pdf&usg=AOvVaw33_e5qkePlFOreCu8NrE2J
[2] https://docs.openstack.org/operations-guide/ops-advanced-configuration.html

Steps to reproduce
==================
On master devstack

1)apt-get install fio

2) # create some user data in an initfile
$ cat bigfile.sh
#!/bin/sh
while true ; do cat /bin/busybox >> bigfile ; done

3)Spawn an m1.small instance with init script (we must generate snapshot file fat enought)
$openstack server create --flavor m1.small --image cirros-0.4.0-x86_64-disk --user-data bigfile.sh --nic net-id=private bigfile

4)wait few time that instance become fat enought (flavor is 20GB)
$du -sh /opt/stack/data/nova/instances/85998418-bf2c-4246-8204-c08962876d18/
18G /opt/stack/data/nova/instances/85998418-bf2c-4246-8204-c08962876d18/

5) stop instance to reach bug quicker (avoid qemu blockcopy job)
$openstack server stop 85998418-bf2c-4246-8204-c08962876d18

6) get file system busy(during snapshot) in another terminal:
$fio --ioengine libaio --iodepth 32 --direct=1 --runtime=50000 --size=100m --time_based --bs=4k --rw=randwrite --numjobs 12 --name=job

7) start snapshot:
$openstack server image create 85998418-bf2c-4246-8204-c08962876d18 --name snap1

8) you may check in top qemu-img generating snapshot file, after that nova-compute become in state D waiting for IO

9) check in log rabbit disconnection because nova-compute become stuck:
$ sudo journalctl -u devstack@n-cpu -f
Feb 27 13:42:26 alex-devstack nova-compute[31997]: DEBUG oslo_service.periodic_task [None req-e86ab239-c81d-4674-b275-9438999abfff None None] Running periodic task ComputeManager._check_instance_build_time
Feb 27 13:42:26 alex-devstack nova-compute[31997]: ERROR oslo.messaging._drivers.impl_rabbit [-] [813d5cb3-ce73-4294-8d4a-20422c2088c6] AMQP server on 54.39.183.136:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.:amqp.exceptions.ConnectionForced: Too many heartbeats missed
Feb 27 13:42:26 alex-devstack nova-compute[31997]: ERROR oslo.messaging._drivers.impl_rabbit [-] [e61dc56c-e72a-4a6b-b459-e064fa4544f7] AMQP server on 54.39.183.136:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
Feb 27 13:42:26 alex-devstack nova-compute[31997]: INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
Feb 27 13:42:26 alex-devstack nova-compute[31997]: INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
Feb 27 13:42:26 alex-devstack nova-compute[31997]: INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
Feb 27 13:42:26 alex-devstack nova-compute[31997]: INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer

NOTE:
-Issue happens only during glance upload, after snapshot file is "extracted" by a qemu-img convert, because upload IOs are made in nova-compute process using native python glance client lib.
-qemu-img convert which is done just before, is run in a subprocess(), so outside of nova-compute, and there is no issue here.
-This fio is a bit overkill (12 jobs) because I run on a ssd disk. In operation the issue may happen relatively quickly if snapshot dirextory is on:
-a platter disk, only two simultenous snapshot and nova-compute can start to hang.
-on a network drive with limited resource(ex: rbd mound) it can start with only one snapshot. (whole earth is backing up at 00 am)
-busy mean sar -d 1 10 say disk is 100% busy

SOLUTION?
==========

-Probably best solution is to use use true multithreading/workers on nova-compute to solve this ?
It seems that nova-compute is the only nova service that did not use oslo service workers:
$grep service.serve nova/cmd/*
nova/cmd/api.py: launcher.launch_service(server, workers=server.workers or 1)
nova/cmd/api_metadata.py: service.serve(server, workers=server.workers)
nova/cmd/api_os_compute.py: service.serve(server, workers=server.workers)
nova/cmd/compute.py: service.serve(server)
nova/cmd/conductor.py: service.serve(server, workers=workers)
nova/cmd/scheduler.py: service.serve(server, workers=workers)

Maybe there is reason to not using it ? Would it be a costly change(some redesign)?

-Alternatively we must assume that all I/O workload should be put out of nova-compute process(fork()).

Expected result
===============
nova-compute should stay healthy when doing IO on busy fs
(even if task run slower)

Actual result
=============
nova-compute become stuck and disconnected from rabbitmq.

Tags:

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-22:

This looks weird as we execute a separate process thru privsep for the qemu-img convert call which generates I/O writes, so it doesn't look to me a problem with service workers or greenthreading.

Marking Wishlist for the moment as it's an optimization request for I/O but I could modify it to another importance status if we see it as a bug.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-22:

The forked process in question https://github.com/openstack/nova/blob/58aaffade9f78c5fdc6b0d26ec26b70908b7a6aa/nova/privsep/qemu.py#L81

tags:	added: libvirt snapshot
Changed in nova:
importance:	Undecided → Wishlist

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-22:

Leaving the bug to New as I'm not sure we correctly triaged the bug.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2020-04-22:

OK, I have a better understanding on what we manage in Nova.

So, in order to prevent intensive I/O operations to compete, we recently added in https://review.opendev.org/#/c/609180/ a new compute semaphore that would hold competitive accesses to disk within Nova thru a configuration option.

In our case here, you can see the semaphore there : https://github.com/openstack/nova/blob/58aaffade9f78c5fdc6b0d26ec26b70908b7a6aa/nova/virt/libvirt/driver.py#L2420

As you can see in the change, there is now a config option that allows you to refrain concurrent I/O ops in Nova by setting max_concurrent_disk_ops to other value but 0.

Closing the bug as Invalid/Wishlist as I think there is a mitigation to your existing problem but feel free to discuss this problem either by filing a blueprint for describing the problem or by raising this at our next virtual PTG : https://etherpad.openstack.org/p/nova-victoria-ptg (which I hope you could attend)

Changed in nova:
status:	New → Invalid

Revision history for this message

Alexandre arents (aarents) wrote on 2020-04-27:

Thanks for the link https://review.opendev.org/#/c/609180/, It helps to avoid overload by limiting concurrent IO task running.

But it does not solve my issue, let me explain further my use case:

Imagine a compute host with 100 vms, at night customer run backup (instance snaphot).
We want to allow 5/10 simultanous snapshot even if snapshot dir is busy,
As backup is best effort, it is not an issue for us if it takes hours.

And unfortunetly I can have the issue with max_concurrent_disk_ops=1 if snap directory is on a network shared filesystem, as network drive can be busy regardless of my compute activity.

-> resource contention on the snapshot directory should just slowdown upload, not make hang nova-compute.

Currently nova-compute process own IO during upload to glance(it uses glance client)

To scale nova-compute, we need to offload nova-compute from doing those IO.
We workaround this by running snapshot upload to glance in an utils.execute(*curl) in order to get it fork().
this prevent us to have hundreds of nova-compute flapping as describe in bugs report.

As you suggested, we can discuss that during VPTG!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-10: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/734776

Changed in nova:
assignee:	nobody → Alexandre arents (aarents)
status:	Invalid → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-06: Fix merged to nova (master)

Reviewed: https://review.opendev.org/734776
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=be9b7358473ca5276cff3c6491f88cd512d118b8
Submitter: Zuul
Branch: master

commit be9b7358473ca5276cff3c6491f88cd512d118b8
Author: Alexandre Arents <email address hidden>
Date: Tue Jun 9 14:17:37 2020 +0000

Snapshot: offload glance upload in a native thread

Execute glance upload in a native thread as it may block the current
coroutine until it completes.

    Despite the fact we use eventlet monkey_patching [1] to achieve cooperative
    yielding for network IO, file IO on busy file system may still get
    nova-compute hanging.

Stick those IO in a native thread using eventlet tpool.execute() [2]
avoid this issue.

[1] https://eventlet.net/doc/patching.html
[2] https://eventlet.net/doc/threading.html

Closes-Bug: #1874032
Change-Id: I8dbc579e0037969aab4f2bb500fccfbde4190726