Tempest test test_schedule_to_all_nodes is failed

Bug #1606546 reported by Sofiia Andriichenko
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Confirmed
Medium
Unassigned
9.x
Confirmed
Medium
Unassigned

Bug Description

Detailed bug description:
9.1 snapshot #51

Configuration:
Settings:
Compute - QEMU.
Network - Neutron with VLAN segmentation.
Storage Backends - LVM
Additional services - Install Ironic, Install Sahara

In tab Settings->Compute check Nova quotas
In tab Settings->OpenStack Services check enable Install Ceilometer and Aodh
In tab Networks->Other check enable Neutron DVR
In tab Settings->Security check enable TLS for OpenStack public endpoints, HTTPS for Horizon

Nodes: controller, compute, ironic,cinder, Telemetry - MongoDB

Steps to reproduce:
    1. Deploy ISO in configuration see (Detailed bug description)
    2. Navigate to controller node
    3. Install git (use apt-get install git)
    4. Clone script to deploy rally + tempest
       # git clone https://github.com/obutenko/mos-rally-verify.git
    5. Navigate to the https://github.com/obutenko/mos-rally-verify
    6. Execute necessary steps to deploy tempest
    7. Tun test in debug mode
        #rally --debug verify start --regex tempest.scenario.test_server_multinode.TestServerMultinode.test_schedule_to_all_nodes

Expected results:
Test is passed

Actual result:
Test is Failed
(see comments)

Reproducibility:
See attachment

Workaround:
---

Impact:
---

Description of the environment:
See (Detailed bug description)

Additional information:
Trace:
Traceback (most recent call last):
  File "/home/rally/.rally/tempest/for-deployment-cad2ad9f-b0e9-4ac8-ae7c-a8d4da4b061c/tempest/test.py", line 106, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "/home/rally/.rally/tempest/for-deployment-cad2ad9f-b0e9-4ac8-ae7c-a8d4da4b061c/tempest/scenario/test_server_multinode.py", line 71, in test_schedule_to_all_nodes
    wait_until='ACTIVE')
  File "/home/rally/.rally/tempest/for-deployment-cad2ad9f-b0e9-4ac8-ae7c-a8d4da4b061c/tempest/scenario/manager.py", line 237, in create_server
    image_id=image_id, **kwargs)
  File "/home/rally/.rally/tempest/for-deployment-cad2ad9f-b0e9-4ac8-ae7c-a8d4da4b061c/tempest/common/compute.py", line 167, in create_test_server
    % server['id'])
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/home/rally/.rally/tempest/for-deployment-cad2ad9f-b0e9-4ac8-ae7c-a8d4da4b061c/tempest/common/compute.py", line 149, in create_test_server
    clients.servers_client, server['id'], wait_until)
  File "/home/rally/.rally/tempest/for-deployment-cad2ad9f-b0e9-4ac8-ae7c-a8d4da4b061c/tempest/common/waiters.py", line 77, in wait_for_server_status
    server_id=server_id)
tempest.exceptions.BuildErrorException: Server ada719f3-b9a7-43e3-9c13-d61e3edc60e5 failed to build and is in ERROR status
Details: {u'message': u'OVS configuration failed with: Unexpected error while running command.\nCommand: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo780dfdee-92 -- add-port br-int qvo780dfdee-92 -- set Interface qvo780dfdee-92 ext', u'code': 500, u'details': u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1926, in _do_build_and_run_instance\n filter_properties)\n File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2116, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'created': u'2016-07-26T02:12:37Z'}

snapshot https://drive.google.com/open?id=0B2db-pBC_yblM3UwY3hOOGtEQWM

Tags: tempest
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :
Changed in mos:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → MOS Nova (mos-nova)
description: updated
Changed in mos:
assignee: MOS Nova (mos-nova) → Sergey Nikitin (snikitin)
Revision history for this message
Sergey Nikitin (snikitin) wrote :

Are you sure that you attached the right snapshot? because I can't find even the uuid of the failed server there.

Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Sergey, I'm shure.

Revision history for this message
Sergey Nikitin (snikitin) wrote :

I can't reproduce it manually. Also I'm still thinking that snapshot was created from some other env because:

1. according to snapshot logs 'test_schedule_to_all_nodes' passed:

$ grep -irn "test_schedule_to_all_nodes"
nailgun.test.domain.local/root/log.log:1888:{1} tempest.scenario.test_server_multinode.TestServerMultinode.test_schedule_to_all_nodes [64.947272s] ... ok
node-5/root/log.log:1888:{1} tempest.scenario.test_server_multinode.TestServerMultinode.test_schedule_to_all_nodes [64.947272s] ... ok

2. I can't find the error message or server uuid in logs:

nik@snikitin: fuel-snapshot-2016-07-26_09-04-51$ grep -irn "OVS configuration failed"
nik@snikitin: fuel-snapshot-2016-07-26_09-04-51$

nik@snikitin: fuel-snapshot-2016-07-26_09-04-51$ grep -irn "ada719f3-b9a7-43e3-9c13-d61e3edc60e5"
nik@snikitin:fuel-snapshot-2016-07-26_09-04-51$

3. according logs from snaphot env have the following nodes:

'slave-01_controller_mongo'
'slave-02_controller_mongo'
'slave-03_controller_mongo'
'slave-04_compute_cinder'
'slave-05_compute_cinder'
'slave-06_ironic'

In the bug description mentioned only 4 nodes: "controller, compute, ironic,cinder, Telemetry - MongoDB"

Or the description is incorrect?

So I mark bug as incomplete till it will be reproduced on CI again or the correct snapshot will be attached.

Changed in mos:
status: Confirmed → Incomplete
assignee: Sergey Nikitin (snikitin) → Sofiia Andriichenko (sandriichenko)
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :
Changed in mos:
status: Incomplete → Confirmed
assignee: Sofiia Andriichenko (sandriichenko) → Sergey Nikitin (snikitin)
Revision history for this message
Sergey Nikitin (snikitin) wrote :

node-2/var/log/nova/nova-compute.log
Nova tries to create port by command:

sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo8e5a0737-a6 -- add-port br-int qvo8e5a0737-a6 -- set Interface qvo8e5a0737-a6 external-ids:iface-id=8e5a0737-a6d3-4d70-83c5-af9fb69e420d external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:19:4e:93 external-ids:vm-uuid=2bed27c1-70b6-4db7-b03a-fbcf34d51215

but OVS command failed by timeout after 2 minutes:
nailgun.test.domain.local/var/log/remote/node-2.test.domain.local/ovs-vsctl.log

2016-08-01T08:32:55.158210+00:00 notice: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=120 -- --if-exists del-port qvo28267e0a-5d -- add-port br-int qvo28267e0a-5d -- set Interface qvo28267e0a-5d external-ids:iface-id=28267e0a-5d1c-4795-aa03-461c3f93b319 external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:87:1e:a4 external-ids:vm-uuid=0208dfc4-6a9b-4f55-9650-83d625a9222b

2016-08-01T08:34:55.160326+00:00 warning:
ovs|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)

It's NOT a timeout of waiting event from neutron. It's a timeout of execution of ovs-vsctl command.

Changed in mos:
assignee: Sergey Nikitin (snikitin) → MOS Neutron (mos-neutron)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Looks like some openvswitch issue which is hard to debug without openvswitch logs (no openvswitch/ dir on compute nodes in the snapshot). Need snapshot with openvswitch logs (var/log/openvswitch/) or live repro for further debugging.

What is reproducibility of the bug?

Changed in mos:
status: Confirmed → Incomplete
assignee: MOS Neutron (mos-neutron) → Sofiia Andriichenko (sandriichenko)
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Oleg Bondarev, it is how often this test is failed. if this test fail again I can give you access to environment

Changed in mos:
status: Incomplete → Confirmed
assignee: Sofiia Andriichenko (sandriichenko) → MOS Neutron (mos-neutron)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Let's keep it 'incomplete' until next failure then

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Alexander Ignatov (aignatov) wrote :

Reassigning this bug to Sofia since we are waiting live env from QA side.

Changed in mos:
assignee: MOS Neutron (mos-neutron) → Sofiia Andriichenko (sandriichenko)
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Can't reproduce manual and on CI

Changed in mos:
status: Incomplete → Invalid
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Reproduce manual and on CI
9.1 snapshot #294

Traceback (most recent call last):
  File "tempest/test.py", line 107, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_server_multinode.py", line 71, in test_schedule_to_all_nodes
    wait_until='ACTIVE')
  File "tempest/scenario/manager.py", line 205, in create_server
    image_id=image_id, **kwargs)
  File "tempest/common/compute.py", line 168, in create_test_server
    % server['id'])
  File "/root/.rally/tempest/for-deployment-c63cd82f-4352-4a78-99a3-e52392edb324/.venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/root/.rally/tempest/for-deployment-c63cd82f-4352-4a78-99a3-e52392edb324/.venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "tempest/common/compute.py", line 150, in create_test_server
    clients.servers_client, server['id'], wait_until)
  File "tempest/common/waiters.py", line 95, in wait_for_server_status
    raise exceptions.TimeoutException(message)
tempest.exceptions.TimeoutException: Request timed out
Details: (TestServerMultinode:test_schedule_to_all_nodes) Server d394f3e9-c4d7-48ca-bd3c-fc6e52cebc3f failed to reach ACTIVE status and task state "None" within the required time (300 s). Current status: BUILD. Current task state: scheduling.

Changed in mos:
status: Invalid → Confirmed
assignee: Sofiia Andriichenko (sandriichenko) → nobody
Revision history for this message
Oleg Bondarev (obondarev) wrote :

On the env where the issue was reproduced ^^ I see one compute node in down state. Looking at tempest test code I see that it doesn't check the state of the node when spawning a VM on all nodes - most probably the instance stays in DOWN state because of down compute node.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Confirmed that the test fails due to down compute node

Changed in mos:
status: Confirmed → Invalid
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Reproduced on 9.1 snapshot #355
Trace:
Trace:

Traceback (most recent call last):

  File "tempest/test.py", line 107, in wrapper

    return f(self, *func_args, **func_kwargs)

  File "tempest/scenario/test_server_multinode.py", line 70, in test_schedule_to_all_nodes

    wait_until='ACTIVE')

  File "tempest/scenario/manager.py", line 205, in create_server

    image_id=image_id, **kwargs)

  File "tempest/common/compute.py", line 168, in create_test_server

    % server['id'])

  File "/root/.rally/tempest/for-deployment-ef1438d4-5231-4bae-83d1-117caa2049b6/.venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__

    self.force_reraise()

  File "/root/.rally/tempest/for-deployment-ef1438d4-5231-4bae-83d1-117caa2049b6/.venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise

    six.reraise(self.type_, self.value, self.tb)

  File "tempest/common/compute.py", line 150, in create_test_server

    clients.servers_client, server['id'], wait_until)

  File "tempest/common/waiters.py", line 75, in wait_for_server_status

    server_id=server_id)

tempest.exceptions.BuildErrorException: Server 4596f713-036c-4d30-ad8d-bc78195fea5e failed to build and is in ERROR status

Details: {u'message': u'OVS configuration failed with: Unexpected error while running command.\nCommand: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo0bf8be3f-73 -- add-port br-int qvo0bf8be3f-73 -- set Interface qvo0bf8be3f-73 ext', u'code': 500, u'details': u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1926, in _do_build_and_run_instance\n filter_properties)\n File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2116, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'created': u'2016-10-07T02:24:12Z'

Snapshot:
https://drive.google.com/a/mirantis.com/file/d/0BxPLDs6wcpbDc2pIWnhPWl9tVUU/view?usp=sharing

Changed in mos:
status: Invalid → Confirmed
assignee: nobody → MOS Neutron (mos-neutron)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

General problem with ovs on node-6 (compute) leads to VMs faling to spawn, dvr routers not functioning correctly, overall connectivity problems of all kinds.

root@node-6:~# vi /var/log/openvswitch/ovs-vswitchd.log:

2016-10-07T02:05:30.501Z|00001|ovs_rcu(urcu3)|WARN|blocked 1000 ms waiting for main to quiesce
2016-10-07T02:05:31.501Z|00002|ovs_rcu(urcu3)|WARN|blocked 2000 ms waiting for main to quiesce
2016-10-07T02:05:33.501Z|00003|ovs_rcu(urcu3)|WARN|blocked 4000 ms waiting for main to quiesce
2016-10-07T02:05:37.501Z|00004|ovs_rcu(urcu3)|WARN|blocked 8000 ms waiting for main to quiesce
2016-10-07T02:05:45.501Z|00005|ovs_rcu(urcu3)|WARN|blocked 16000 ms waiting for main to quiesce
2016-10-07T02:06:01.502Z|00006|ovs_rcu(urcu3)|WARN|blocked 32001 ms waiting for main to quiesce
2016-10-07T02:06:33.501Z|00007|ovs_rcu(urcu3)|WARN|blocked 64000 ms waiting for main to quiesce
2016-10-07T02:07:37.501Z|00008|ovs_rcu(urcu3)|WARN|blocked 128000 ms waiting for main to quiesce
2016-10-07T02:09:45.501Z|00009|ovs_rcu(urcu3)|WARN|blocked 256000 ms waiting for main to quiesce
2016-10-07T02:14:01.501Z|00010|ovs_rcu(urcu3)|WARN|blocked 512000 ms waiting for main to quiesce
2016-10-07T02:22:33.501Z|00011|ovs_rcu(urcu3)|WARN|blocked 1024001 ms waiting for main to quiesce
2016-10-07T02:39:37.501Z|00012|ovs_rcu(urcu3)|WARN|blocked 2048000 ms waiting for main to quiesce
2016-10-07T03:13:45.501Z|00013|ovs_rcu(urcu3)|WARN|blocked 4096000 ms waiting for main to quiesce
2016-10-07T04:22:01.501Z|00014|ovs_rcu(urcu3)|WARN|blocked 8192000 ms waiting for main to quiesce
2016-10-07T06:38:33.501Z|00015|ovs_rcu(urcu3)|WARN|blocked 16384000 ms waiting for main to quiesce
2016-10-07T11:11:37.501Z|00016|ovs_rcu(urcu3)|WARN|blocked 32768000 ms waiting for main to quiesce

ovs problem to be investigated

Revision history for this message
Oleg Bondarev (obondarev) wrote :

OVS issue can be fixed by restart: service openvswitch-switch restart;
Then need to restart neutron-ovs-agent and things get back to normal.

OVS issue is some kind of a race condition in rare circumstances.
Lower to Medium and moving to 9.2.

Changed in mos:
importance: High → Medium
milestone: 9.1 → 9.2
Revision history for this message
Oleg Bondarev (obondarev) wrote :

9.2 should have OVS 2.5.1, need to check if the issue is still there with new OVS version. Marking as incomplete for now.

Changed in mos:
status: Confirmed → Incomplete
assignee: MOS Neutron (mos-neutron) → Sofiia Andriichenko (sandriichenko)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Hi Sofia, could you please take a look and update the status for 9.2?

Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Hi, now this test fail regularly, but with other error.

Trace:

Traceback (most recent call last):

  File "tempest/test.py", line 100, in wrapper

    return f(self, *func_args, **func_kwargs)

  File "tempest/scenario/test_server_multinode.py", line 78, in test_schedule_to_all_nodes

    wait_until='ACTIVE')

  File "tempest/scenario/manager.py", line 206, in create_server

    image_id=image_id, **kwargs)

  File "tempest/common/compute.py", line 167, in create_test_server

    % server['id'])

  File "/root/.rally/tempest/for-deployment-265362ef-e144-4b52-9571-4ea6c994547f/.venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__

    self.force_reraise()

  File "/root/.rally/tempest/for-deployment-265362ef-e144-4b52-9571-4ea6c994547f/.venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise

    six.reraise(self.type_, self.value, self.tb)

  File "tempest/common/compute.py", line 149, in create_test_server

    clients.servers_client, server['id'], wait_until)

  File "tempest/common/waiters.py", line 75, in wait_for_server_status

    server_id=server_id)

tempest.exceptions.BuildErrorException: Server a022fa1a-5abf-4906-bb60-66a83fb70097 failed to build and is in ERROR status

Details: {u'message': u'No free port available for instance a022fa1a-5abf-4906-bb60-66a83fb70097.', u'code': 500, u'details': u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1926, in _do_build_and_run_instance\n filter_properties)\n File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2116, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'created': u'2016-11-22T01:52:55Z'}

Changed in mos:
status: Incomplete → Confirmed
assignee: Sofiia Andriichenko (sandriichenko) → nobody
assignee: nobody → MOS Neutron (mos-neutron)
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Test failed with trace: http://paste.openstack.org/show/590094/

nova-compute log: http://paste.openstack.org/show/590096/

From log, we can see that nova tries to boot instance on ironic node using fake driver.
if disable ironic-compute, test will pass.(nova service-disable <hostname> <binary>): http://paste.openstack.org/show/590097/

Changed in mos:
assignee: MOS Neutron (mos-neutron) → MOS Nova (mos-nova)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Sofiia,

I checked the logs and Nova code and see that this exception is raised when the node an instance is scheduled to does not have MAC addresses available - Ironic is special in this case as we can't simply allocate any port in Neutron - we must use the ports available on this specific node. Which brings me to my question, how do enroll an Ironic node to be provisioned? Are sure you also register ports on this node via "ironic port-create" (http://developer.openstack.org/api-ref/baremetal/?expanded=create-port-detail)?

Changed in mos:
assignee: MOS Nova (mos-nova) → Sofiia Andriichenko (sandriichenko)
status: Confirmed → Incomplete
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

Roman,

We don't have port for this node, because it is fake node and we can't boot instance on it. This node created only for basic ironic tests.

When I deleted node, error was changed: http://paste.openstack.org/show/590312/

Changed in mos:
assignee: Sofiia Andriichenko (sandriichenko) → MOS Nova (mos-nova)
status: Incomplete → Confirmed
Revision history for this message
Vasyl Saienko (vsaienko) wrote :

The environment was incorrectly prepared. We should enroll Ironic nodes before trying to spawn instances on baremetal, otherwise tests will fail and it is expected behaviour.

Changed in mos:
status: Confirmed → Won't Fix
status: Won't Fix → Invalid
Changed in mos:
status: Invalid → Confirmed
assignee: MOS Nova (mos-nova) → Oleksiy Butenko (obutenko)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Only one tempest test is affected, Oleksiy is going to check the configuration of the cloud to make sure it is correct.

Changed in mos:
assignee: Oleksiy Butenko (obutenko) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.