test_execute_workload_stabilization intermittently fails because server is deleted before live migration is complete

Bug #1828598 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
watcher
Confirmed
Undecided
Unassigned

Bug Description

I noticed this here:

http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/testr_results.html.gz

Live migration fails because the neutron port on the server is deleted while the server is being live migrated, it actually looks like the instance itself is being destroyed before the live migration is complete, which is what deletes the port.

http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/compute1/logs/screen-n-cpu.txt.gz#_May_10_10_18_19_106717

May 10 10:18:19.106717 ubuntu-bionic-vexxhost-sjc1-0006079177 nova-compute[19758]: INFO nova.compute.manager [None req-0e809061-4c10-4e82-9c17-20ebf2ca3832 tempest-TestExecuteWorkloadBalancingStrategy-1723816204 tempest-TestExecuteWorkloadBalancingStrategy-1723816204] [instance: 27813a13-39dc-490d-86f3-c1877c07a010] Took 0.28 seconds to destroy the instance on the hypervisor.

http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/compute1/logs/screen-n-cpu.txt.gz#_May_10_10_18_21_174978

May 10 10:18:21.174978 ubuntu-bionic-vexxhost-sjc1-0006079177 nova-compute[19758]: WARNING nova.virt.libvirt.driver [None req-4e88ad59-55d7-4cc2-9deb-532900c94ab6 None None] [instance: 27813a13-39dc-490d-86f3-c1877c07a010] Error monitoring migration: Failed to activate binding for port 67f1aaa7-8966-448b-9655-56e19a01fb62 and host ubuntu-bionic-vexxhost-sjc1-0006079176.: PortBindingActivationFailed: Failed to activate binding for port 67f1aaa7-8966-448b-9655-56e19a01fb62 and host ubuntu-bionic-vexxhost-sjc1-0006079176.

You can see the nova-api request to delete the server here:

http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/controller/logs/screen-n-api.txt.gz#_May_10_10_18_18_867714

May 10 10:18:18.867714 ubuntu-bionic-vexxhost-sjc1-0006079176 <email address hidden>[6484]: INFO nova.api.openstack.requestlog [None req-0e809061-4c10-4e82-9c17-20ebf2ca3832 tempest-TestExecuteWorkloadBalancingStrategy-1723816204 tempest-TestExecuteWorkloadBalancingStrategy-1723816204] 38.108.68.233 "DELETE /compute/v2.1/servers/27813a13-39dc-490d-86f3-c1877c07a010" status: 204 len: 0 microversion: 2.1 time: 0.264338

The live migration of that server starts here:

http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/controller/logs/screen-n-api.txt.gz#_May_10_10_17_58_008645

May 10 10:17:58.008645 ubuntu-bionic-vexxhost-sjc1-0006079176 <email address hidden>[6484]: DEBUG nova.compute.api [None req-908b9e12-08da-4f48-bbb2-f3b01c320d92 admin admin] [instance: 27813a13-39dc-490d-86f3-c1877c07a010] Going to try to live migrate instance to ubuntu-bionic-vexxhost-sjc1-0006079176 {{(pid=6485) live_migrate /opt/stack/nova/nova/compute/api.py:4540}}

So live migration starts at 10:17:58.008645, the server delete request is at 10:18:18.867714, the guest is deleted from the hypervisor by 10:18:19.106717, and live migration fails at 10:18:21.174978.

The test creates an audit template from the workload_stabilization strategy, creates an audit from the template, and then polls until the audit is finished - and it's the audit that would kick off the live migration right? So I guess the race is in the workload_stabilization strategy not waiting for the live migration to complete?

Looks like at this point the audit status is SUCCEEDED:

http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/job-output.txt.gz#_2019-05-10_10_18_22_112059

2019-05-10 10:18:22.110668 | controller | 2019-05-10 10:18:09,489 12011 INFO [tempest.lib.common.rest_client] Request (TestExecuteWorkloadBalancingStrategy:test_execute_workload_stabilization): 200 GET http://38.108.68.233:9322/v1/audits/97456a4e-a115-45b9-98fe-92c5795fe443 0.134s
2019-05-10 10:18:22.110818 | controller | 2019-05-10 10:18:09,494 12011 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'Content-Type': 'application/json'}
2019-05-10 10:18:22.110854 | controller | Body: None
2019-05-10 10:18:22.111182 | controller | Response - Headers: {u'connection': 'close', u'server': 'Apache/2.4.29 (Ubuntu)', u'vary': 'OpenStack-API-Version', u'content-length': '1137', u'openstack-api-maximum-version': '1.1', u'openstack-api-version': 'infra-optim 1.0', u'openstack-api-minimum-version': '1.0', 'status': '200', u'date': 'Fri, 10 May 2019 10:18:09 GMT', u'content-type': 'application/json', 'content-location': 'http://38.108.68.233:9322/v1/audits/97456a4e-a115-45b9-98fe-92c5795fe443'}
2019-05-10 10:18:22.112059 | controller | Body: {"auto_trigger": false, "next_run_time": null, "audit_type": "ONESHOT", "interval": null, "uuid": "97456a4e-a115-45b9-98fe-92c5795fe443", "parameters": {"thresholds": {"cpu_util": 0.2}, "granularity": 300, "metrics": ["cpu_util"], "retry_count": 1, "weights": {"cpu_util_weight": 1.0}, "periods": {"node": 60000, "instance": 72000}, "host_choice": "retry", "aggregation_method": {"node": "mean", "instance": "mean"}, "instance_metrics": {"cpu_util": "compute.node.cpu.percent"}}, "goal_uuid": "39be02da-7999-45e0-a134-c2c6c6049b39", "created_at": "2019-05-10T10:18:05+00:00", "links": [{"href": "http://38.108.68.233:9322/v1/audits/97456a4e-a115-45b9-98fe-92c5795fe443", "rel": "self"}, {"href": "http://38.108.68.233:9322/audits/97456a4e-a115-45b9-98fe-92c5795fe443", "rel": "bookmark"}], "goal_name": "workload_balancing", "strategy_name": "workload_stabilization", "updated_at": "2019-05-10T10:18:08+00:00", "state": "SUCCEEDED", "strategy_uuid": "e93ded8c-2503-4629-819a-8360332ff929", "scope": [], "deleted_at": null, "hostname": "ubuntu-bionic-vexxhost-sjc1-0006079176", "name": "workload_stabilization-2019-05-10T10:18:04.810340"}
2019-05-10 10:18:22.112176 | controller | 2019-05-10 10:18:09,494 12011 DEBUG [tempest.lib.common.utils.test_utils] Call partial returns true in 4.405191 seconds

I'm confused because I don't see this in the watcher logs:

https://opendev.org/openstack/watcher/src/commit/64d841b3f2ed2a8844ec2d3db6b3c711ae00839e/watcher/common/nova_helper.py#L281

or when it's waiting for the live migration to complete:

https://opendev.org/openstack/watcher/src/commit/64d841b3f2ed2a8844ec2d3db6b3c711ae00839e/watcher/common/nova_helper.py#L308

https://opendev.org/openstack/watcher/src/commit/64d841b3f2ed2a8844ec2d3db6b3c711ae00839e/watcher/common/nova_helper.py#L330

It looks like that's because of this configuration:

May 10 10:06:34.673237 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-applier[3306]: DEBUG watcher.common.service [-] default_log_levels = ['amqp=WARN', 'amqplib=WARN', 'qpid.messaging=INFO', 'oslo.messaging=INFO', 'sqlalchemy=WARN', 'keystoneclient=INFO', 'stevedore=INFO', 'eventlet.wsgi.server=WARN', 'iso8601=WARN', 'requests=WARN', 'neutronclient=WARN', 'glanceclient=WARN', 'watcher.openstack.common=WARN', 'apscheduler=WARN'] {{(pid=3306) log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2577}}

'watcher.openstack.common=WARN' means I won't see any debug logs from that nova_helper code.

Tags: gate-failure
Matt Riedemann (mriedem)
tags: added: gate-failure
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to watcher (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/658399

Matt Riedemann (mriedem)
Changed in watcher:
status: New → Confirmed
Revision history for this message
licanwei (li-canwei2) wrote :
Download full text (3.3 KiB)

In http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/controller/logs/screen-watcher-decision-engine.txt.gz
log all the notification events from nova.
It seems devstack deleted the server before it finished the migration.
Maybe we should wait until finished migration before doing other action.

May 10 10:18:03.304398 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-decision-engine[2857]: INFO watcher.decision_engine.model.notification.nova [None req-908b9e12-08da-4f48-bbb2-f3b01c320d92 admin admin] Event 'instance.live_migration_pre.start' received from nova-compute:ubuntu-bionic-vexxhost-sjc1-0006079176 with metadata {'timestamp': u'2019-05-10 10:18:03.282237', 'message_id': u'90d034b0-d6c0-40b0-9160-e568769c40b5'}
May 10 10:18:18.869791 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-decision-engine[2857]: INFO watcher.decision_engine.model.notification.nova [None req-0e809061-4c10-4e82-9c17-20ebf2ca3832 tempest-TestExecuteWorkloadBalancingStrategy-1723816204 tempest-TestExecuteWorkloadBalancingStrategy-1723816204] Event 'instance.delete.start' received from nova-compute:ubuntu-bionic-vexxhost-sjc1-0006079177 with metadata {'timestamp': u'2019-05-10 10:18:18.812251', 'message_id': u'999a39cf-602a-404a-82c6-383d0328cb73'}
May 10 10:18:18.869791 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-decision-engine[2857]: DEBUG oslo_messaging._drivers.amqpdriver [-] received message with unique_id: 1a60bce026aa45ffacf7505275aecb31 {{(pid=2857) __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:283}}
May 10 10:18:18.869791 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-decision-engine[2857]: INFO watcher.decision_engine.model.notification.nova [None req-0e809061-4c10-4e82-9c17-20ebf2ca3832 tempest-TestExecuteWorkloadBalancingStrategy-1723816204 tempest-TestExecuteWorkloadBalancingStrategy-1723816204] Event 'instance.shutdown.start' received from nova-compute:ubuntu-bionic-vexxhost-sjc1-0006079177 with metadata {'timestamp': u'2019-05-10 10:18:18.819975', 'message_id': u'1cca30c7-dccf-4518-bf35-0ff41d9d9480'}
May 10 10:18:19.753430 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-decision-engine[2857]: DEBUG oslo_messaging._drivers.amqpdriver [-] received message with unique_id: 28a022d94598481a97195f1ef1eae5e6 {{(pid=2857) __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:283}}
May 10 10:18:19.755615 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-decision-engine[2857]: INFO watcher.decision_engine.model.notification.nova [None req-908b9e12-08da-4f48-bbb2-f3b01c320d92 admin admin] Event 'instance.live_migration_post.start' received from nova-compute:ubuntu-bionic-vexxhost-sjc1-0006079177 with metadata {'timestamp': u'2019-05-10 10:18:19.768263', 'message_id': u'0dc5bf77-22d9-4b1b-a23e-951c2cd1e2c4'}
May 10 10:18:21.635306 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-decision-engine[2857]: INFO watcher.decision_engine.model.notification.nova [None req-0e809061-4c10-4e82-9c17-20ebf2ca3832 tempest-TestExecuteWorkloadBalancingStrategy-1723816204 tempest-TestExecuteWorkloadBalancingStrategy-1723816204] Event 'instance.delete.end' received from nova-comput...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

It's the tempest plugin / test that is deleting the server, right? As part of the test resource cleanup. The question is why is the test cleaning up the server before the live migration is complete? The test is waiting for the action plan to complete, which it says it is successful, but the thing I couldn't figure out (due to lack of debug logs in the nova_helper code in watcher) was at what point watcher thought the live migration was complete and why - because then it would mark the action plan as successful which triggers the test to start tearing down.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to watcher (master)

Reviewed: https://review.opendev.org/658399
Committed: https://git.openstack.org/cgit/openstack/watcher/commit/?id=5c2939f23b8ebf23ccfd0a4dd41fadc9f344ac6e
Submitter: Zuul
Branch: master

commit 5c2939f23b8ebf23ccfd0a4dd41fadc9f344ac6e
Author: Matt Riedemann <email address hidden>
Date: Fri May 10 11:29:08 2019 -0400

    Remove watcher.openstack.common=WARN from _DEFAULT_LOG_LEVELS

    Hard-coding watcher.openstack.common to warning level logging
    only makes it hard to debug watcher's interactions with other
    services, like when it's triggering and monitoring a server live
    migration.

    Since debug logging is controlled via the "debug" configuration
    option, we can just rely on that to filter out debug logs within
    watcher itself.

    Note this has been this way since change
    I699e0ab082657880998d8618fe29eb7f56c6c661 back in 2015 and there
    was no explanation why the watcher.openstack.common logging
    was set to WARN level.

    Change-Id: I939403a4ae36d1aa9ea83badb9404bc37d18a1a6
    Related-Bug: #1828598

Revision history for this message
Matt Riedemann (mriedem) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.