Fuel for OpenStack

[SWARM][9.2] OSTF smoke fails in bdd_ha_one_controller_compact due to high IO wait

Series mitaka
Bug #1651083

Bug #1651083 reported by Vladimir Jigulin on 2016-12-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Won't Fix	High	Dmitry Mescheryakov	Fuel for OpenStack 9.2
	Mitaka	Confirmed	High	Dmitry Mescheryakov	Fuel for OpenStack 9.x-updates

Bug Description

Test bdd_ha_one_controller_compact with steps:
1. Create cluster with Neutron vlan
2. Add 1 nodes with controller role
3. Add 1 nodes with compute and cinder-block-device role
4. Deploy the cluster
5. Network check
6. Run OSTF tests

failed with traceback:

AssertionError: Failed 3 OSTF tests; should fail 0 tests. Names of failed tests:
  - Check network connectivity from instance via floating IP (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.
  - Launch instance, create snapshot, launch instance from snapshot (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.
  - Create user and authenticate with it. (failure) Can not authenticate in Horizon. Please refer to OpenStack logs for more details.

Some errors from logs:
nova-compute.log
Instance failed network setup after 1 attempt(s)
ERROR nova.compute.manager Traceback (most recent call last):
PortBindingFailed: Binding failed for port 3831a319-a050-45ca-8236-0f921f1b1997, please check n

neutron-server.log
ERROR neutron.plugins.ml2.managers [req-05e94f4a-de9d-4aab-ba1d-ae0966b7c9b5 1670148561324ae08ff49396830925d4 58b16f1423bb4c16ba04e24c5ff16042 - - -] Failed to bind port 3831a319-a050-45ca-8236-0f921f1b1997 on host node-2.test.domain.local for vnic_type normal using segments [{'segmentation_id': 1018L, 'physical_network': u'physnet2', 'id': u'ed8b6f46-e3bf-4be6-87c4-61f41e16ec33', 'network_type': u'vlan'}]
ERROR neutron.db.l3_agentschedulers_db [req-8ecb6963-7096-4d5d-b1f3-9f0d0c63d5a1 - - - - -] Failed to reschedule router 09e536c3-88ac-4eb1-b553-6473d6f8c2d0
RouterReschedulingFailed: Failed rescheduling router 09e536c3-88ac-4eb1-b553-6473d6f8c2d0: no eligible l3 agent found.

ovs-vsctl.log
ovs|00001|db_ctl_base|ERR|no row "int-br-prv" in table Interface
ovs|00001|db_ctl_base|ERR|no row "int-br-prv" in table Port
ovs|00001|db_ctl_base|ERR|no row "phy-br-prv" in table Port

horizon_error.log
Timeout when reading response headers from daemon process 'horizon': /usr/share/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi

Atop logs show that at that time (starting from 22:38:04) the system CPU spent a lot of time in iowait state (> 100% on 2 cpu machine). It seems that that was the root cause of the issue.

Expected results: OSTF passed.
Actual result: OSTF failed.
Reproducibility: once
Version: 9.2 snapshot #646
Snapshot: https://drive.google.com/file/d/0Bw7ZahkM7_sJQVVwV3lmTFl6ZmM

See original description

Dmitry Mescheryakov (dmitrymex) on 2016-12-19

Changed in fuel:
assignee:	nobody → Dmitry Mescheryakov (dmitrymex)

Oleksiy Molchanov (omolchanov) on 2016-12-19

Changed in fuel:
milestone:	none → 8.0-updates
milestone:	8.0-updates → 9.2
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-12-22:

The issue happened because the system was very overloaded - check atop logs from node-1. Here it can be seen that starting from 22:38:04 and until end of test iowait raised above 100%, from time to time reaching above 150% (on a 2 cpu machine). Atop also shows that some neutron-server processes are in D (waiting for IO) state periodically during that time.

An example from l3-agent.log and neutron-server.log on node-1:
Request b34d3ca23d2a47ce8af2c8836268a024 was sent from l3 agent at 22:39:04. It was received by server at 22:40:19 and server sent reply at 22:42:12. Request-reply took three minutes where it should have taken 3 seconds.

It is not clear, what exactly took so much IO. Possible options:
* active swapping (1G was used at that time)
* bad slow disk
* noisy neighbors (the node was a VM on a Jenkins slave)

QA team, lets see if IO wait issue reoccurs. If it does, we'll see what we can do about it.

Changed in fuel:
status:	Confirmed → Incomplete
summary:	- [SWARM][9.2] OSTF smoke fails in bdd_ha_one_controller_compact + [SWARM][9.2] OSTF smoke fails in bdd_ha_one_controller_compact due to + high IO wait
description:	updated
Changed in fuel:
assignee:	Dmitry Mescheryakov (dmitrymex) → Fuel QA Team (fuel-qa)