nova-compute become stuck when doing IO on busy file system
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Wishlist
|
Alexandre arents |
Bug Description
Description
===========
It seems that when nova-compute process run I/O intensive task on a busy file system,
it can become stuck and got disconnected from rabbitmq cluster.
From my understanding nova-compute do not use true OS multithreading,
but internal python multi-tasking also called cooperative multitasking,
This have some limitations like using only one core at a time,
and task did not schedule well under IO workload [1]
It look likes a "known" limitation [2]:
"Periodic tasks are important to understand because of limitations
in the threading model that OpenStack uses.
OpenStack uses cooperative threading in Python,
which means that if something long and complicated is running,
it will block other tasks inside that process from running unless it voluntarily yields execution to another cooperative thread."
I think we hit the same limitation than periodic task when we do a snapshot in qcow2/raw environment, during glance upload, if the snapshot directory is on a busy file system.
[1] https:/
[2] https:/
Steps to reproduce
==================
On master devstack
1)apt-get install fio
2) # create some user data in an initfile
$ cat bigfile.sh
#!/bin/sh
while true ; do cat /bin/busybox >> bigfile ; done
3)Spawn an m1.small instance with init script (we must generate snapshot file fat enought)
$openstack server create --flavor m1.small --image cirros-
4)wait few time that instance become fat enought (flavor is 20GB)
$du -sh /opt/stack/
18G /opt/stack/
5) stop instance to reach bug quicker (avoid qemu blockcopy job)
$openstack server stop 85998418-
6) get file system busy(during snapshot) in another terminal:
$fio --ioengine libaio --iodepth 32 --direct=1 --runtime=50000 --size=100m --time_based --bs=4k --rw=randwrite --numjobs 12 --name=job
7) start snapshot:
$openstack server image create 85998418-
8) you may check in top qemu-img generating snapshot file, after that nova-compute become in state D waiting for IO
9) check in log rabbit disconnection because nova-compute become stuck:
$ sudo journalctl -u devstack@n-cpu -f
Feb 27 13:42:26 alex-devstack nova-compute[
Feb 27 13:42:26 alex-devstack nova-compute[
Feb 27 13:42:26 alex-devstack nova-compute[
Feb 27 13:42:26 alex-devstack nova-compute[
Feb 27 13:42:26 alex-devstack nova-compute[
Feb 27 13:42:26 alex-devstack nova-compute[
Feb 27 13:42:26 alex-devstack nova-compute[
NOTE:
-Issue happens only during glance upload, after snapshot file is "extracted" by a qemu-img convert, because upload IOs are made in nova-compute process using native python glance client lib.
-qemu-img convert which is done just before, is run in a subprocess(), so outside of nova-compute, and there is no issue here.
-This fio is a bit overkill (12 jobs) because I run on a ssd disk. In operation the issue may happen relatively quickly if snapshot dirextory is on:
-a platter disk, only two simultenous snapshot and nova-compute can start to hang.
-on a network drive with limited resource(ex: rbd mound) it can start with only one snapshot. (whole earth is backing up at 00 am)
-busy mean sar -d 1 10 say disk is 100% busy
SOLUTION?
==========
-Probably best solution is to use use true multithreading/
It seems that nova-compute is the only nova service that did not use oslo service workers:
$grep service.serve nova/cmd/*
nova/cmd/api.py: launcher.
nova/cmd/
nova/cmd/
nova/cmd/
nova/cmd/
nova/cmd/
Maybe there is reason to not using it ? Would it be a costly change(some redesign)?
-Alternatively we must assume that all I/O workload should be put out of nova-compute process(fork()).
Expected result
===============
nova-compute should stay healthy when doing IO on busy fs
(even if task run slower)
Actual result
=============
nova-compute become stuck and disconnected from rabbitmq.
This looks weird as we execute a separate process thru privsep for the qemu-img convert call which generates I/O writes, so it doesn't look to me a problem with service workers or greenthreading.
Marking Wishlist for the moment as it's an optimization request for I/O but I could modify it to another importance status if we see it as a bug.