tripleo

Octavia HealthMonitorScenarioTest are failing with Load Balanceris immutable and cannot be updated.

Bug #1934994 reported by chandan kumar on 2021-07-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Incomplete	High	Unassigned	tripleo xena-3

Bug Description

octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest is failing on standalone sc10
job in network component line.[1]

https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz

```
0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_HTTP_healthmonitor_CRUD [43.107531s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_healthmonitor.py", line 60, in test_LC_HTTP_healthmonitor_CRUD
        const.HEALTH_MONITOR_HTTP)
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_healthmonitor.py", line 384, in _test_healthmonitor_CRUD
        CONF.load_balancer.check_timeout)
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/waiters.py", line 79, in wait_for_status
        raise exceptions.UnexpectedResponseCode(message)
    tempest.lib.exceptions.UnexpectedResponseCode: Unexpected response code received
    Details: (HealthMonitorScenarioTest:test_LC_HTTP_healthmonitor_CRUD) show_loadbalancer provisioning_status updated to an invalid state of ERROR
```

By digging further in the octavia health manager logs. [2], The instance is failed to build due to no valid host found
https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/containers/octavia/health-manager.log.txt.gz

```
|__Atom 'octavia-failover-loadbalancer-flow-octavia-failover-loadbalancer-flow-octavia-create-amp-for-lb-subflow-octavia-create-amphora-indb' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'loadbalancer_id': 'c4d18c96-c609-443a-87af-c7ac8ce74d3d'}, 'provides': '4876f14b-16ff-4d85-8d9c-92f4ce1ee99b'}
                 |__Flow 'octavia-failover-loadbalancer-flow-octavia-failover-loadbalancer-flow-octavia-create-amp-for-lb-subflow'
                    |__Flow 'octavia-failover-loadbalancer-flow-octavia-create-amp-for-failover-subflow'
                       |__Atom 'octavia.controller.worker.v1.tasks.network_tasks.GetVIPSecurityGroupID' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'loadbalancer_id': 'c4d18c96-c609-443a-87af-c7ac8ce74d3d'}, 'provides': 'afe72b59-03ae-4363-aaab-243f9b8f72dc'}
                          |__Atom 'octavia.controller.worker.v1.tasks.database_tasks.MarkAmphoraHealthBusy' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'amphora': <octavia.common.data_models.Amphora object at 0x7f2eaeaabf60>}, 'provides': None}
                             |__Atom 'octavia.controller.worker.v1.tasks.database_tasks.MarkAmphoraPendingDeleteInDB' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'amphora': <octavia.common.data_models.Amphora object at 0x7f2eaeaabf60>}, 'provides': None}
                                |__Atom 'octavia.controller.worker.v1.tasks.lifecycle_tasks.AmphoraToErrorOnRevertTask' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'amphora': <octavia.common.data_models.Amphora object at 0x7f2eaeaabf60>}, 'provides': None}
                                   |__Flow 'octavia-failover-amphora-flow': octavia.common.exceptions.ComputeBuildException: Failed to build compute instance due to: {'code': 500, 'created': '2021-07-07T13:38:39Z', 'message': 'No valid host was found. ', 'details': 'Traceback (most recent call last):\n File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1519, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 881, in _schedule_instances\n return_alternates=return_alternates)\n File "/usr/lib/python3.6/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/usr/lib/python3.6/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 179, in call\n transport_options=self.transport_options)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send\n transport_options=transport_options)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 682, in send\n transport_options=transport_options)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 672, in _send\n raise result\nnova.exception_Remote.NoValidHost_Remote: No valid host was found. \nTraceback (most recent call last):\n\n File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 241, in inner\n return func(*args, **kwargs)\n\n File "/usr/lib/python3.6/site-packages/nova/scheduler/manager.py", line 190, in select_destinations\n raise exception.NoValidHost(reason="")\n\nnova.exception.NoValidHost: No valid host was found. \n\n'}
```

From nova-conductor logs [3]
```
2021-07-07 13:38:38.266 7 ERROR nova.conductor.manager [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found.
Traceback (most recent call last):

File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 241, in inner
return func(*args, **kwargs)

File "/usr/lib/python3.6/site-packages/nova/scheduler/manager.py", line 190, in select_destinations
raise exception.NoValidHost(reason="")

nova.exception.NoValidHost: No valid host was found.
```

till we found what is causing it?, we are moving the tests to skip list.
Links:
[1]. https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz
[2]. https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/containers/octavia/health-manager.log.txt.gz
[3]. https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/containers/nova/nova-conductor.log.txt.gz

Tags:

Revision history for this message

chandan kumar (chkumar246) wrote on 2021-07-08:

Since this job is not part of component promotion criteria Adding this bug as a promoter blocker to get traction and get it fixed.

Revision history for this message

chandan kumar (chkumar246) wrote on 2021-07-08:

https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/799970 moves the tests to skip list.

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2021-07-12:

the no valid host was because the placment service did not return any hosts

that probably mean the vm/hosts were full but we cant verify that without runing placement in debug mode

currently we dont have the detailed placment debug logs to see why it got no hosts but so far i am not seeing any bug with nova this just looks like it legitimately got no hosts so it could be a concurrence issue with multiple test if its intermittent or some other infra issue.

can you update the job to run placment with debug logs?

setting this to incomplete sincec there is not enough data in the logs to triage this currently.

Changed in tripleo:
status:	Triaged → Incomplete

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2021-07-12:

2021-07-07 13:38:36.807 8 DEBUG nova.scheduler.manager [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Starting to schedule for instances: ['f7d44e20-e47f-40ad-aa1d-713b357658c2'] select_destinations /usr/lib/python3.6/site-packages/nova/scheduler/manager.py:124
2021-07-07 13:38:36.817 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] require_image_type_support request filter added required trait COMPUTE_IMAGE_TYPE_RAW require_image_type_support /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:195
2021-07-07 13:38:36.818 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Request filter 'require_image_type_support' took 0.0 seconds wrapper /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:47
2021-07-07 13:38:36.819 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] compute_status_filter request filter added forbidden trait COMPUTE_STATUS_DISABLED compute_status_filter /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:253
2021-07-07 13:38:36.819 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Request filter 'compute_status_filter' took 0.0 seconds wrapper /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:47
2021-07-07 13:38:36.820 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Request filter 'accelerators_filter' took 0.0 seconds wrapper /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:47
2021-07-07 13:38:38.060 8 INFO nova.scheduler.manager [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Got no allocation candidates from the Placement API. This could be due to insufficient resources or a temporary occurrence as compute nodes start up.

Revision history for this message

chandan kumar (chkumar246) wrote on 2021-07-19:

Proposed the testproject change https://review.rdoproject.org/r/c/testproject/+/34597 to get debug placement log.

Revision history for this message

chandan kumar (chkumar246) wrote on 2021-07-20:

https://review.rdoproject.org/r/c/testproject/+/34597 On this testproject with skip list revert https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316

The jobs are passing
periodic-tripleo-ci-centos-8-scenario010-standalone-network-master SUCCESS 2h 36m 13s

and

https://logserver.rdoproject.org/97/34597/1/check/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/30064c2/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz
```
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_HTTPS_healthmonitor_CRUD [22.345464s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_HTTP_healthmonitor_CRUD [22.511256s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_PING_healthmonitor_CRUD [22.500439s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_TCP_healthmonitor_CRUD [22.532029s] ... ok
{1} octavia_tempest_plugin.tests.scenario.v2.test_listener.ListenerScenarioTest.test_http_least_connections_listener_CRUD [87.123008s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_TLS_healthmonitor_CRUD [23.079007s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_UDP_healthmonitor_CRUD [18.756871s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_HTTPS_healthmonitor_CRUD [24.555489s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_HTTP_healthmonitor_CRUD [24.116514s] ... ok
{1} octavia_tempest_plugin.tests.scenario.v2.test_listener.ListenerScenarioTest.test_http_round_robin_listener_CRUD [91.261516s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_PING_healthmonitor_CRUD [24.063673s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_TCP_healthmonitor_CRUD [19.832414s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_TLS_healthmonitor_CRUD [24.248578s] ... ok
```

I think we can merge the https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316.

Thank you @Sean for looking into this, Here is the Debugged placement logs also: https://logserver.rdoproject.org/97/34597/1/check/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/30064c2/logs/undercloud/var/log/containers/placement/placement.log.txt.gz Please have a look, if anything we can tweak in this job.

https://review.rdoproject.org/r/c/testproject/+/34597 On this testproject with skip list revert https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316

The jobs are passing
periodic-tripleo-ci-centos-8-scenario010-standalone-network-master	SUCCESS 2h 36m 13s

and

I think we can merge the https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316.

Revision history for this message

Ronelle Landy (rlandy) wrote on 2021-07-21:

This error is showing up in the network component tests in rhos-17 downstream:

https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario010-standalone-network-rhos-17/5121ab8/logs/undercloud/var/log/tempest/stestr_results.html

Marios Andreou (marios-b) on 2021-07-21

Changed in tripleo:
milestone:	xena-2 → xena-3

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-07-23:

To summarize... the original issue here was filed for the job at [1]. Based on this bug we skipped then re-added the tests with [2] since the job was back to green (was the original issue was seen just once? I only see one job log referenced in the description).

Then we started seeing the same test fail in the downstream rhos-17 job [3] and indeed the latest example is there from yesterday 22/07 [4].

I am not sure this is the same issue though.

The upstream trace is like (error)

Details: (HealthMonitorScenarioTest:test_LC_HTTP_healthmonitor_CRUD) show_loadbalancer provisioning_status updated to an invalid state of ERROR

The downstream trace is like (timeout)

Details: (HealthMonitorScenarioTest:setUpClass) show_loadbalancer provisioning_status failed to update to ACTIVE within the required time 900. Current status of show_loadbalancer: PENDING_CREATE

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-scenario010-standalone-network-master
[2] https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316
[3] https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-rhel-8-scenario010-standalone-network-rhos-17
[4] https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario010-standalone-network-rhos-17/70f8a6e/logs/undercloud/var/log/tempest/stestr_results.html

Revision history for this message

Marios Andreou (marios-b) wrote on 2021-07-23 (last edit on 2021-07-23):

so following comment #8 above, we already have a tracker for the downstream issue at https://bugzilla.redhat.com/show_bug.cgi?id=1980528 so moving this one to invalid

please move back if you disagree.

[EDIT] the bug is already in status incomplete so leaving as is

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1980528 Edit

Bug watches keep track of this bug in other bug trackers.