[SWARM][9.2] OSTF smoke fails in bdd_ha_one_controller_compact due to high IO wait
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Won't Fix
|
High
|
Dmitry Mescheryakov | ||
Mitaka |
Confirmed
|
High
|
Dmitry Mescheryakov |
Bug Description
Test bdd_ha_
1. Create cluster with Neutron vlan
2. Add 1 nodes with controller role
3. Add 1 nodes with compute and cinder-block-device role
4. Deploy the cluster
5. Network check
6. Run OSTF tests
failed with traceback:
AssertionError: Failed 3 OSTF tests; should fail 0 tests. Names of failed tests:
- Check network connectivity from instance via floating IP (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.
- Launch instance, create snapshot, launch instance from snapshot (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.
- Create user and authenticate with it. (failure) Can not authenticate in Horizon. Please refer to OpenStack logs for more details.
Some errors from logs:
nova-compute.log
Instance failed network setup after 1 attempt(s)
ERROR nova.compute.
PortBindingFailed: Binding failed for port 3831a319-
neutron-server.log
ERROR neutron.
ERROR neutron.
RouterReschedul
ovs-vsctl.log
ovs|00001|
ovs|00001|
ovs|00001|
horizon_error.log
Timeout when reading response headers from daemon process 'horizon': /usr/share/
Atop logs show that at that time (starting from 22:38:04) the system CPU spent a lot of time in iowait state (> 100% on 2 cpu machine). It seems that that was the root cause of the issue.
Expected results: OSTF passed.
Actual result: OSTF failed.
Reproducibility: once
Version: 9.2 snapshot #646
Snapshot: https:/
Changed in fuel: | |
assignee: | nobody → Dmitry Mescheryakov (dmitrymex) |
Changed in fuel: | |
milestone: | none → 8.0-updates |
milestone: | 8.0-updates → 9.2 |
status: | New → Confirmed |
importance: | Undecided → High |
Changed in fuel: | |
status: | Confirmed → Won't Fix |
The issue happened because the system was very overloaded - check atop logs from node-1. Here it can be seen that starting from 22:38:04 and until end of test iowait raised above 100%, from time to time reaching above 150% (on a 2 cpu machine). Atop also shows that some neutron-server processes are in D (waiting for IO) state periodically during that time.
An example from l3-agent.log and neutron-server.log on node-1: e8af2c8836268a0 24 was sent from l3 agent at 22:39:04. It was received by server at 22:40:19 and server sent reply at 22:42:12. Request-reply took three minutes where it should have taken 3 seconds.
Request b34d3ca23d2a47c
It is not clear, what exactly took so much IO. Possible options:
* active swapping (1G was used at that time)
* bad slow disk
* noisy neighbors (the node was a VM on a Jenkins slave)
QA team, lets see if IO wait issue reoccurs. If it does, we'll see what we can do about it.