[SWARM][9.2] OSTF smoke fails in bdd_ha_one_controller_compact due to high IO wait

Bug #1651083 reported by Vladimir Jigulin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
Dmitry Mescheryakov
Mitaka
Confirmed
High
Dmitry Mescheryakov

Bug Description

Test bdd_ha_one_controller_compact with steps:
1. Create cluster with Neutron vlan
2. Add 1 nodes with controller role
3. Add 1 nodes with compute and cinder-block-device role
4. Deploy the cluster
5. Network check
6. Run OSTF tests

failed with traceback:

AssertionError: Failed 3 OSTF tests; should fail 0 tests. Names of failed tests:
  - Check network connectivity from instance via floating IP (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.
  - Launch instance, create snapshot, launch instance from snapshot (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.
  - Create user and authenticate with it. (failure) Can not authenticate in Horizon. Please refer to OpenStack logs for more details.

Some errors from logs:
nova-compute.log
Instance failed network setup after 1 attempt(s)
ERROR nova.compute.manager Traceback (most recent call last):
PortBindingFailed: Binding failed for port 3831a319-a050-45ca-8236-0f921f1b1997, please check n

neutron-server.log
ERROR neutron.plugins.ml2.managers [req-05e94f4a-de9d-4aab-ba1d-ae0966b7c9b5 1670148561324ae08ff49396830925d4 58b16f1423bb4c16ba04e24c5ff16042 - - -] Failed to bind port 3831a319-a050-45ca-8236-0f921f1b1997 on host node-2.test.domain.local for vnic_type normal using segments [{'segmentation_id': 1018L, 'physical_network': u'physnet2', 'id': u'ed8b6f46-e3bf-4be6-87c4-61f41e16ec33', 'network_type': u'vlan'}]
ERROR neutron.db.l3_agentschedulers_db [req-8ecb6963-7096-4d5d-b1f3-9f0d0c63d5a1 - - - - -] Failed to reschedule router 09e536c3-88ac-4eb1-b553-6473d6f8c2d0
RouterReschedulingFailed: Failed rescheduling router 09e536c3-88ac-4eb1-b553-6473d6f8c2d0: no eligible l3 agent found.

ovs-vsctl.log
ovs|00001|db_ctl_base|ERR|no row "int-br-prv" in table Interface
ovs|00001|db_ctl_base|ERR|no row "int-br-prv" in table Port
ovs|00001|db_ctl_base|ERR|no row "phy-br-prv" in table Port

horizon_error.log
Timeout when reading response headers from daemon process 'horizon': /usr/share/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi

Atop logs show that at that time (starting from 22:38:04) the system CPU spent a lot of time in iowait state (> 100% on 2 cpu machine). It seems that that was the root cause of the issue.

Expected results: OSTF passed.
Actual result: OSTF failed.
Reproducibility: once
Version: 9.2 snapshot #646
Snapshot: https://drive.google.com/file/d/0Bw7ZahkM7_sJQVVwV3lmTFl6ZmM

Changed in fuel:
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
Changed in fuel:
milestone: none → 8.0-updates
milestone: 8.0-updates → 9.2
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue happened because the system was very overloaded - check atop logs from node-1. Here it can be seen that starting from 22:38:04 and until end of test iowait raised above 100%, from time to time reaching above 150% (on a 2 cpu machine). Atop also shows that some neutron-server processes are in D (waiting for IO) state periodically during that time.

An example from l3-agent.log and neutron-server.log on node-1:
Request b34d3ca23d2a47ce8af2c8836268a024 was sent from l3 agent at 22:39:04. It was received by server at 22:40:19 and server sent reply at 22:42:12. Request-reply took three minutes where it should have taken 3 seconds.

It is not clear, what exactly took so much IO. Possible options:
 * active swapping (1G was used at that time)
 * bad slow disk
 * noisy neighbors (the node was a VM on a Jenkins slave)

QA team, lets see if IO wait issue reoccurs. If it does, we'll see what we can do about it.

Changed in fuel:
status: Confirmed → Incomplete
summary: - [SWARM][9.2] OSTF smoke fails in bdd_ha_one_controller_compact
+ [SWARM][9.2] OSTF smoke fails in bdd_ha_one_controller_compact due to
+ high IO wait
description: updated
Changed in fuel:
assignee: Dmitry Mescheryakov (dmitrymex) → Fuel QA Team (fuel-qa)
Revision history for this message
Yury Tregubov (ytregubov) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Dima, the new reproduce was on env with 4Gb RAM, could you check fail one more time and identify a cause of issue?

Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Changed in fuel:
status: Incomplete → Confirmed
assignee: Fuel QA Team (fuel-qa) → Dmitry Mescheryakov (dmitrymex)
Revision history for this message
Sergey Novikov (snovikov) wrote :
Roman Vyalov (r0mikiam)
Changed in fuel:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.