Octavia HealthMonitorScenarioTest are failing with Load Balanceris immutable and cannot be updated.

Bug #1934994 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Incomplete
High
Unassigned

Bug Description

octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest is failing on standalone sc10
job in network component line.[1]

https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz

```
0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_HTTP_healthmonitor_CRUD [43.107531s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_healthmonitor.py", line 60, in test_LC_HTTP_healthmonitor_CRUD
        const.HEALTH_MONITOR_HTTP)
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/scenario/v2/test_healthmonitor.py", line 384, in _test_healthmonitor_CRUD
        CONF.load_balancer.check_timeout)
      File "/usr/lib/python3.6/site-packages/octavia_tempest_plugin/tests/waiters.py", line 79, in wait_for_status
        raise exceptions.UnexpectedResponseCode(message)
    tempest.lib.exceptions.UnexpectedResponseCode: Unexpected response code received
    Details: (HealthMonitorScenarioTest:test_LC_HTTP_healthmonitor_CRUD) show_loadbalancer provisioning_status updated to an invalid state of ERROR
```

By digging further in the octavia health manager logs. [2], The instance is failed to build due to no valid host found
https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/containers/octavia/health-manager.log.txt.gz

```
|__Atom 'octavia-failover-loadbalancer-flow-octavia-failover-loadbalancer-flow-octavia-create-amp-for-lb-subflow-octavia-create-amphora-indb' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'loadbalancer_id': 'c4d18c96-c609-443a-87af-c7ac8ce74d3d'}, 'provides': '4876f14b-16ff-4d85-8d9c-92f4ce1ee99b'}
                 |__Flow 'octavia-failover-loadbalancer-flow-octavia-failover-loadbalancer-flow-octavia-create-amp-for-lb-subflow'
                    |__Flow 'octavia-failover-loadbalancer-flow-octavia-create-amp-for-failover-subflow'
                       |__Atom 'octavia.controller.worker.v1.tasks.network_tasks.GetVIPSecurityGroupID' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'loadbalancer_id': 'c4d18c96-c609-443a-87af-c7ac8ce74d3d'}, 'provides': 'afe72b59-03ae-4363-aaab-243f9b8f72dc'}
                          |__Atom 'octavia.controller.worker.v1.tasks.database_tasks.MarkAmphoraHealthBusy' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'amphora': <octavia.common.data_models.Amphora object at 0x7f2eaeaabf60>}, 'provides': None}
                             |__Atom 'octavia.controller.worker.v1.tasks.database_tasks.MarkAmphoraPendingDeleteInDB' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'amphora': <octavia.common.data_models.Amphora object at 0x7f2eaeaabf60>}, 'provides': None}
                                |__Atom 'octavia.controller.worker.v1.tasks.lifecycle_tasks.AmphoraToErrorOnRevertTask' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'amphora': <octavia.common.data_models.Amphora object at 0x7f2eaeaabf60>}, 'provides': None}
                                   |__Flow 'octavia-failover-amphora-flow': octavia.common.exceptions.ComputeBuildException: Failed to build compute instance due to: {'code': 500, 'created': '2021-07-07T13:38:39Z', 'message': 'No valid host was found. ', 'details': 'Traceback (most recent call last):\n File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1519, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 881, in _schedule_instances\n return_alternates=return_alternates)\n File "/usr/lib/python3.6/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/usr/lib/python3.6/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 179, in call\n transport_options=self.transport_options)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send\n transport_options=transport_options)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 682, in send\n transport_options=transport_options)\n File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 672, in _send\n raise result\nnova.exception_Remote.NoValidHost_Remote: No valid host was found. \nTraceback (most recent call last):\n\n File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 241, in inner\n return func(*args, **kwargs)\n\n File "/usr/lib/python3.6/site-packages/nova/scheduler/manager.py", line 190, in select_destinations\n raise exception.NoValidHost(reason="")\n\nnova.exception.NoValidHost: No valid host was found. \n\n'}
```

From nova-conductor logs [3]
```
2021-07-07 13:38:38.266 7 ERROR nova.conductor.manager [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found.
Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 241, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python3.6/site-packages/nova/scheduler/manager.py", line 190, in select_destinations
    raise exception.NoValidHost(reason="")

nova.exception.NoValidHost: No valid host was found.
```

till we found what is causing it?, we are moving the tests to skip list.
Links:
[1]. https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz
[2]. https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/containers/octavia/health-manager.log.txt.gz
[3]. https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/containers/nova/nova-conductor.log.txt.gz

Revision history for this message
chandan kumar (chkumar246) wrote :

Since this job is not part of component promotion criteria Adding this bug as a promoter blocker to get traction and get it fixed.

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
sean mooney (sean-k-mooney) wrote :

the no valid host was because the placment service did not return any hosts

that probably mean the vm/hosts were full but we cant verify that without runing placement in debug mode

currently we dont have the detailed placment debug logs to see why it got no hosts but so far i am not seeing any bug with nova this just looks like it legitimately got no hosts so it could be a concurrence issue with multiple test if its intermittent or some other infra issue.

can you update the job to run placment with debug logs?

setting this to incomplete sincec there is not enough data in the logs to triage this currently.

Changed in tripleo:
status: Triaged → Incomplete
Revision history for this message
sean mooney (sean-k-mooney) wrote :

https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/addb439/logs/undercloud/var/log/containers/nova/nova-scheduler.log.txt.gz

2021-07-07 13:38:36.807 8 DEBUG nova.scheduler.manager [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Starting to schedule for instances: ['f7d44e20-e47f-40ad-aa1d-713b357658c2'] select_destinations /usr/lib/python3.6/site-packages/nova/scheduler/manager.py:124
2021-07-07 13:38:36.817 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] require_image_type_support request filter added required trait COMPUTE_IMAGE_TYPE_RAW require_image_type_support /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:195
2021-07-07 13:38:36.818 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Request filter 'require_image_type_support' took 0.0 seconds wrapper /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:47
2021-07-07 13:38:36.819 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] compute_status_filter request filter added forbidden trait COMPUTE_STATUS_DISABLED compute_status_filter /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:253
2021-07-07 13:38:36.819 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Request filter 'compute_status_filter' took 0.0 seconds wrapper /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:47
2021-07-07 13:38:36.820 8 DEBUG nova.scheduler.request_filter [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Request filter 'accelerators_filter' took 0.0 seconds wrapper /usr/lib/python3.6/site-packages/nova/scheduler/request_filter.py:47
2021-07-07 13:38:38.060 8 INFO nova.scheduler.manager [req-f37eed4b-c1f3-4ec6-a221-d7abd376cb31 dd2a84a271ad47febef78d3330df7f5a d7c948e99c09444fbd0b936358951c69 - default default] Got no allocation candidates from the Placement API. This could be due to insufficient resources or a temporary occurrence as compute nodes start up.

Revision history for this message
chandan kumar (chkumar246) wrote :

Proposed the testproject change https://review.rdoproject.org/r/c/testproject/+/34597 to get debug placement log.

Revision history for this message
chandan kumar (chkumar246) wrote :

https://review.rdoproject.org/r/c/testproject/+/34597 On this testproject with skip list revert https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316

The jobs are passing
periodic-tripleo-ci-centos-8-scenario010-standalone-network-master SUCCESS 2h 36m 13s

and

https://logserver.rdoproject.org/97/34597/1/check/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/30064c2/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz
```
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_HTTPS_healthmonitor_CRUD [22.345464s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_HTTP_healthmonitor_CRUD [22.511256s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_PING_healthmonitor_CRUD [22.500439s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_TCP_healthmonitor_CRUD [22.532029s] ... ok
{1} octavia_tempest_plugin.tests.scenario.v2.test_listener.ListenerScenarioTest.test_http_least_connections_listener_CRUD [87.123008s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_TLS_healthmonitor_CRUD [23.079007s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_LC_UDP_healthmonitor_CRUD [18.756871s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_HTTPS_healthmonitor_CRUD [24.555489s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_HTTP_healthmonitor_CRUD [24.116514s] ... ok
{1} octavia_tempest_plugin.tests.scenario.v2.test_listener.ListenerScenarioTest.test_http_round_robin_listener_CRUD [91.261516s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_PING_healthmonitor_CRUD [24.063673s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_TCP_healthmonitor_CRUD [19.832414s] ... ok
{0} octavia_tempest_plugin.tests.scenario.v2.test_healthmonitor.HealthMonitorScenarioTest.test_RR_TLS_healthmonitor_CRUD [24.248578s] ... ok
```

I think we can merge the https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316.

Thank you @Sean for looking into this, Here is the Debugged placement logs also: https://logserver.rdoproject.org/97/34597/1/check/periodic-tripleo-ci-centos-8-scenario010-standalone-network-master/30064c2/logs/undercloud/var/log/containers/placement/placement.log.txt.gz Please have a look, if anything we can tweak in this job.

Revision history for this message
Ronelle Landy (rlandy) wrote :
Changed in tripleo:
milestone: xena-2 → xena-3
Revision history for this message
Marios Andreou (marios-b) wrote :

To summarize... the original issue here was filed for the job at [1]. Based on this bug we skipped then re-added the tests with [2] since the job was back to green (was the original issue was seen just once? I only see one job log referenced in the description).

Then we started seeing the same test fail in the downstream rhos-17 job [3] and indeed the latest example is there from yesterday 22/07 [4].

I am not sure this is the same issue though.

The upstream trace is like (error)

Details: (HealthMonitorScenarioTest:test_LC_HTTP_healthmonitor_CRUD) show_loadbalancer provisioning_status updated to an invalid state of ERROR

The downstream trace is like (timeout)

Details: (HealthMonitorScenarioTest:setUpClass) show_loadbalancer provisioning_status failed to update to ACTIVE within the required time 900. Current status of show_loadbalancer: PENDING_CREATE

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-scenario010-standalone-network-master
[2] https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/801316
[3] https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-rhel-8-scenario010-standalone-network-rhos-17
[4] https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario010-standalone-network-rhos-17/70f8a6e/logs/undercloud/var/log/tempest/stestr_results.html

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

so following comment #8 above, we already have a tracker for the downstream issue at https://bugzilla.redhat.com/show_bug.cgi?id=1980528 so moving this one to invalid

please move back if you disagree.

[EDIT] the bug is already in status incomplete so leaving as is

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.