n-cpu.service consuming 100% of CPU indeterminately
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
Description
==============
I used fault injection to assess the robustness of the nova-conductor, and by injecting a specific sequence of failures I saw a failure that can threaten the robustness of the system. The resulting of applying these faults in the interface of nova-conductor prevent the nova-compute provisioning new instances.
Steps to reproduce
=======
I reproduced this bug 100% from 10 attempts. I used devstack/queens.
The workload I used is of the following steps:
1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; and the reference image 'cirros.0.3.4' for instance; all other settings can be the defaults of admin account;
2) Rebuild with an alternative image: for instance, 'cirros 0.4.0';
3) Rebuild with the reference image again;
4) Shelve the instance;
5) Delete the instance;
Below, I describe the faultload. For each time a fault is injected, the workload is executed from its begin. The steps are:
1) Intercept the first RPC message (i.e. AMQP) that calls for 'schedule_
2) Inject the 'fault' in 'schedule_
The pseudo-algorithm:
1. execute workload
2. for each fault in ['2', '-1000000000000
2.1. execute workload in parallel with faultload(fault)
3. see the CPU activity for the process n-cpu.service of devstack
Expected result
==================
nova-compute handles the faults not impacting in future requests.
Actual result
================
nova-compute consumes 100% of CPU and new instances is set to 'error' state without any clue about the issue, so it is not possible to create new instances without restarting n-cpu.service
Environment
==============
Devstack/Queens in Single Machine with defaults.
Logs & Configs
=================
Logs attached.
description: | updated |
tags: | added: compute |
tags: | added: fault-injection |
I'm going to say the same thing as bug 1801733 - this is super nifty and interesting, but realistically is not a concern and will most likely never get addressed.