"openstack server list" cmd hangs after standby controller reboot

Bug #1833730 reported by Peng Peng
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bart Wensley

Bug Description

Brief Description
-----------------
in DX system after "sudo reboot -f" standby controller, "openstack server list" cmd stop working

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: test_evacuate.py::TestTisGuest::test_evacuate_vms

Expected Behavior
------------------
"openstack server list" cmd should keep working when rebooting standby controller

Actual Behavior
----------------
not working

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 20190621T013000Z

Last Pass
---------
Lab: WCP_71_75
Load: 20190620T013000Z

Timestamp/Logs
--------------

[2019-06-21 12:55:53,533] 423 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-06-21 12:55:53,533] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-06-21 12:55:55,959] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| 3c06d27a-f7e0-4184-bfed-cb4e4db6b15b | tenant1-image_vol-14 | ACTIVE | tenant2-mgmt-net=192.168.189.52; tenant2-net2=172.18.2.238 | tis-centos-guest | flv_nolocaldisk |
| 5235d268-6839-49a1-8b19-799bc00a0674 | tenant1-image_novol-13 | ACTIVE | tenant2-mgmt-net=192.168.189.56; tenant2-net2=172.18.2.135 | tis-centos-guest | flv_nolocaldisk |
| 9b9aa4a3-9eef-473d-be3f-836ee2eafa5b | tenant1-vol_local-12 | ACTIVE | tenant2-mgmt-net=192.168.189.39; tenant2-net2=172.18.2.151 | | flv_localdisk |
| 9efe809a-9ae4-4326-aba7-59d7674faa29 | tenant1-vol_nolocal-11 | ACTIVE | tenant2-mgmt-net=192.168.189.42; tenant2-net2=172.18.2.240 | | flv_nolocaldisk |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
controller-1:~$

[2019-06-21 12:58:01,981] 137 INFO MainThread host_helper.reboot_hosts:: Rebooting controller-0
[2019-06-21 12:58:01,981] 268 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-06-21 12:58:53,140] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-06-21 13:08:53,248] 358 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ '].
expect timeout.

[2019-06-21 13:12:12,854] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-06-21 13:12:15,490] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-06-21 13:15:45,834] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-06-21 13:15:48,301] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------------+--------+------------------------------------------------------------+------------------+-----------------+
| 3c06d27a-f7e0-4184-bfed-cb4e4db6b15b | tenant1-image_vol-14 | ACTIVE | tenant2-mgmt-net=192.168.189.52; tenant2-net2=172.18.2.238 | tis-centos-guest | flv_nolocaldisk |

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

the tar file attached is only for controller-0. Please provide the controller-1 logs as well.

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (5.2 KiB)

controller-1 log did not collected.

automation log shows that "server list" cmd hung for 10 mins.

[2019-06-21 12:58:53,140] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-06-21 13:08:53,248] 358 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ '].
expect timeout.
[2019-06-21 13:08:53,249] 691 DEBUG MainThread ssh.send_control:: Sending ctrl+c
[2019-06-21 13:08:53,336] 387 DEBUG MainThread ssh.expect :: Output:
^CTraceback (most recent call last):
  File "/usr/bin/openstack", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 210, in main
    return OpenStackShell().run(argv)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/display.py", line 116, in run
    column_names, data = self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/openstackclient/compute/v2/server.py", line 1275, in take_action
    limit=parsed_args.limit)
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 892, in list
    "servers")
  File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 254, in _list
    resp, body = self.api.client.get(url)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 375, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 72, in request
    **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 534, in request
    resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 237, in request
    return self.session.request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 835, in request
    resp = send(**kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 926, in _send_request
    resp = self.session.request(method, url, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/pyth...

Read more...

Revision history for this message
Frank Miller (sensfan22) wrote :

The suspicion is this issue is a result of only 1 nova-api-proxy pod running in the system and it was likely running on the standby controller when it was rebooted. Logs would be required from controller-1 to confirm.

Peng please run a test where nova-api-proxy is running on the standby controller, reboot the standby controller, and determine how long it takes before the openstack server-list command starts working again. This time will tell us how long it takes to fail the nova-api-proxy pod and have it come up on the active controller.

summary: - "openstack server list" cmd not working after standby controller reboot
+ "openstack server list" cmd hangs after standby controller reboot
Revision history for this message
Peng Peng (ppeng) wrote :

Test pre-condition:
make nova-api-proxy running on the standby controller,
make sure "openstack server list" cmd execution properly;
reboot the standby controller.

AFter standby controller reboot, run "openstack server list" cmd every min on multi terminals
Test result:
0' executing cmd, cmd return: Internal Server Error (HTTP 500)
1' executing cmd, cmd hung
2' still hung
3' still hung
3.5' return as Connection timed out

Same scenario happened on different session. cmd hung for 2.5' mins then return time out.

After 6-7 mins standby controller boot up, but cmd still hung up.

Around 12 mins after reboot. cmd execution is back to normal.

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Download full text (5.0 KiB)

This issue was seen in regression test run 3 test cases failed due to this issue(openstack hypervisor list) .on load 20190622T013000Z
Details: CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list' failed to execute. Output: ^CTraceback (most recent call last):
 E File "/usr/bin/openstack", line 10, in <module>

 raise exceptions.CLIRejected("CLI '{}' failed to execute. Output: {}".format(complete_cmd, cmd_output))
 E utils.exceptions.CLIRejected: CLI command is rejected.
 E Details: CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list' failed to execute. Output: ^CTraceback (most recent call last):
 E File "/usr/bin/openstack", line 10, in <module>
 E sys.exit(main())
 E File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 210, in main
 E return OpenStackShell().run(argv)
 E File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
 E ret_val = super(OpenStackShell, self).run(argv)
 E File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run
 E result = self.run_subcommand(remainder)
 E File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand
 E ret_value = super(OpenStackShell, self).run_subcommand(argv)
 E File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
 E result = cmd.run(parsed_args)
 E File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
 E return super(Command, self).run(parsed_args)
 E File "/usr/lib/python2.7/site-packages/cliff/display.py", line 116, in run
 E column_names, data = self.take_action(parsed_args)
 E File "/usr/lib/python2.7/site-packages/openstackclient/compute/v2/hypervisor.py", line 60, in take_action
 E data = compute_client.hypervisors.list()
 E File "/usr/lib/python2.7/site-packages/novaclient/api_versions.py", line 393, in substitution
 E return methods[-1].func(obj, *args, **kwargs)
 E File "/usr/lib/python2.7/site-packages/novaclient/v2/hypervisors.py", line 59, in list
 E return self._list_base(detailed=detailed)
 E File "/usr/lib/python2.7/site-packages/novaclient/v2/hypervisors.py", line 50, in _list_base
 E return self._list(path, 'hypervisors')
 E File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 254, in _list
 E resp, body = self.api.client.get(url)
 E File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 375, in get
 E return self.request(url, 'GET', **kwargs)
 E Fi...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 gating -- limitation in the current implementation where nova-api-proxy is only running on one controller. This results in delayed recovery (up to 12mins) if the controller w/ nova-api-proxy goes down.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Dariush Eslimi (deslimi)
Revision history for this message
Chris Friesen (cbf123) wrote :

The 12 minutes is odd, even given that nova-api-proxy is on the standby. I would not expect it to take that long to recover.

Revision history for this message
Chris Friesen (cbf123) wrote :

Looking at the logs, controller-0 came up at 2019-06-21T12:59:18. We see mariadb coming up initially at 13:07:41 but it fails to join the DB cluster and aborts at 13:07:51. We see another attempt at 13:08:44 and it is synced with the DB cluster at 13:09:08.

This doesn't explain why the other mariadb server isn't handling the requests though, and we need the logs from the other controller node to answer that.

Revision history for this message
Tao Liu (tliu88) wrote :

I suggest a quick test to evaluate the active-active nova-api-proxy deployment prior to making the override changes.

After the stx-openstack is applied
1. kubectl scale deployment nova-api-proxy -n openstack --replicas=2
2. Tail logs for both pods
kubectl logs -f -n openstack <nova-api-proxy-pod1>
kubectl logs -f -n openstack <nova-api-proxy-pod2>
3. Create a VM and perform some VM stop/start/pause actions, and ensure each request is processed by one pod

Revision history for this message
Dariush Eslimi (deslimi) wrote :

Cindy, can you please help with assigning this to a member of your team. Thanks.

Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Cindy Xie (xxie1)
Revision history for this message
Dariush Eslimi (deslimi) wrote :

Bart will take over, as it seems it is more than just implementing active-active proxy.

Changed in starlingx:
assignee: Cindy Xie (xxie1) → Bart Wensley (bartwensley)
Revision history for this message
Bart Wensley (bartwensley) wrote :

I have done some testing in the WP_1-2 lab where the issue was originally raised, with a load from July 16th. I am not able to reproduce the 12 minute "openstack server list" outage time. This is likely due to several fixes that have gone in since this LP was raised, including a fix to reduce the amount of time before pods are evicted when a node becomes unavailable.

However, I do see that the "openstack server list" command still fails for over a minute when the standby controller is rebooted and the nova-api-proxy pod is running on that controller. I did some testing with 2 replicas running for the nova-api-proxy (as per Tao's suggestion above) and this improves the time significantly - the "openstack server list" command now fails for about 30 seconds when the standby controller is rebooted. Note that there is always going to be some failure time when a standby AIO-DX controller is rebooted, because the MariaDB goes down on any AIO-DX controller reboot, due to the way we handle redundancy for MariaDB.

I am going to use this LP to make the change to replicate the nova-api-proxy.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672544

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/672544
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=6790b7e7c7b7742170554c5b219a02abbf6fcd16
Submitter: Zuul
Branch: master

commit 6790b7e7c7b7742170554c5b219a02abbf6fcd16
Author: Bart Wensley <email address hidden>
Date: Fri Jul 19 14:14:28 2019 -0500

    Replicate nova-api-proxy pod

    The nova-api-proxy is currently running in a single pod on one of
    the controllers. To improve recovery time when a controller
    fails, the nova-api-proxy pod will now be run with replicas
    set to two and anti-affinity configured so there is a pod on
    each controller.

    Closes-bug: 1833730
    Change-Id: Iacd17251b86050e337d9a0f832b9dfa6e9864fce
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was not observed recently

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.