Bug #1833730 ""openstack server list” cmd hangs after standby co... : Bugs : StarlingX

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-21:

#1

ALL_NODES_20190621.145533.tar Edit (27.2 MiB, application/x-tar)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-06-21:

#2

the tar file attached is only for controller-0. Please provide the controller-1 logs as well.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-21:

#3

Download full text (5.2 KiB)

controller-1 log did not collected.

automation log shows that "server list" cmd hung for 10 mins.

[2019-06-21 12:58:53,140] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-06-21 13:08:53,248] 358 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ '].
expect timeout.
[2019-06-21 13:08:53,249] 691 DEBUG MainThread ssh.send_control:: Sending ctrl+c
[2019-06-21 13:08:53,336] 387 DEBUG MainThread ssh.expect :: Output:
^CTraceback (most recent call last):
  File "/usr/bin/openstack", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 210, in main
    return OpenStackShell().run(argv)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/display.py", line 116, in run
    column_names, data = self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/openstackclient/compute/v2/server.py", line 1275, in take_action
    limit=parsed_args.limit)
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 892, in list
    "servers")
  File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 254, in _list
    resp, body = self.api.client.get(url)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 375, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 72, in request
    **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 534, in request
    resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 237, in request
    return self.session.request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 835, in request
    resp = send(**kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 926, in _send_request
    resp = self.session.request(method, url, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/pyth...

controller-1 log did not collected.

automation log shows that "server list" cmd hung for 10 mins.

[2019-06-21 12:58:53,140] 268  DEBUG MainThread ssh.send    :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-06-21 13:08:53,248] 358  WARNING MainThread ssh.expect  :: No match found for ['.*controller\\-[01][:| ].*\\$ ']. 
expect timeout.
[2019-06-21 13:08:53,249] 691  DEBUG MainThread ssh.send_control:: Sending ctrl+c
[2019-06-21 13:08:53,336] 387  DEBUG MainThread ssh.expect  :: Output: 
^CTraceback (most recent call last):
  File "/usr/bin/openstack", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 210, in main
    return OpenStackShell().run(argv)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/display.py", line 116, in run
    column_names, data = self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/openstackclient/compute/v2/server.py", line 1275, in take_action
    limit=parsed_args.limit)
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 892, in list
    "servers")
  File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 254, in _list
    resp, body = self.api.client.get(url)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 375, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 72, in request
    **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 534, in request
    resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 237, in request
    return self.session.request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 835, in request
    resp = send(**kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 926, in _send_request
    resp = self.session.request(method, url, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 444, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
controller-1:~$ 
[2019-06-21 13:08:53,336] 308  DEBUG MainThread ssh.flush   :: Buffer is flushed by reading out the rest of the output
[2019-06-21 13:08:53,336] 439  WARNING MainThread ssh.exec_cmd:: Timeout exceeded.
<pexpect.pxssh.pxssh object at 0x7f3a7af1fc18>
command: /usr/bin/ssh
args: ['/usr/bin/ssh', '-o', 'RSAAuthentication=no', '-o', 'PubkeyAuthentication=no', '-o', 'StrictHostKeyChecking=no', '-o', 'UserKnownHostsFile=/dev/null', '-l', 'sysadmin', '128.224.150.254']
searcher: None
buffer (last 100 chars): 't --os-identity-api-version 3 - \r-os-interface internal --os-region-name RegionOne server list --a\r\n'
before (last 100 chars): 't --os-identity-api-version 3 - \r-os-interface internal --os-region-name RegionOne server list --a\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 14030
child_fd: 17
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: <_io.TextIOWrapper name='/sandbox/AUTOMATION_LOGS/wp_1_2/201906210650/ssh_wp_1_2.log' mode='a+' encoding='UTF-8'>
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
[2019-06-21 13:08:53,336] 268  DEBUG MainThread ssh.send    :: Send 'echo $?'

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-06-21:

#4

The suspicion is this issue is a result of only 1 nova-api-proxy pod running in the system and it was likely running on the standby controller when it was rebooted. Logs would be required from controller-1 to confirm.

Peng please run a test where nova-api-proxy is running on the standby controller, reboot the standby controller, and determine how long it takes before the openstack server-list command starts working again. This time will tell us how long it takes to fail the nova-api-proxy pod and have it come up on the active controller.

summary:

- "openstack server list" cmd not working after standby controller reboot
+ "openstack server list" cmd hangs after standby controller reboot

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-22:

#5

Test pre-condition:
make nova-api-proxy running on the standby controller,
make sure "openstack server list" cmd execution properly;
reboot the standby controller.

AFter standby controller reboot, run "openstack server list" cmd every min on multi terminals
Test result:
0' executing cmd, cmd return: Internal Server Error (HTTP 500)
1' executing cmd, cmd hung
2' still hung
3' still hung
3.5' return as Connection timed out

Same scenario happened on different session. cmd hung for 2.5' mins then return time out.

After 6-7 mins standby controller boot up, but cmd still hung up.

Around 12 mins after reboot. cmd execution is back to normal.

Numan Waheed (nwaheed) on 2019-06-24

tags:

added: stx.retestneeded

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-06-24:

#6

Download full text (5.0 KiB)

This issue was seen in regression test run 3 test cases failed due to this issue(openstack hypervisor list) .on load 20190622T013000Z
Details: CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list' failed to execute. Output: ^CTraceback (most recent call last):
E File "/usr/bin/openstack", line 10, in <module>

raise exceptions.CLIRejected("CLI '{}' failed to execute. Output: {}".format(complete_cmd, cmd_output))
E utils.exceptions.CLIRejected: CLI command is rejected.
E Details: CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list' failed to execute. Output: ^CTraceback (most recent call last):
E File "/usr/bin/openstack", line 10, in <module>
E sys.exit(main())
E File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 210, in main
E return OpenStackShell().run(argv)
E File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
E ret_val = super(OpenStackShell, self).run(argv)
E File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run
E result = self.run_subcommand(remainder)
E File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand
E ret_value = super(OpenStackShell, self).run_subcommand(argv)
E File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
E result = cmd.run(parsed_args)
E File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
E return super(Command, self).run(parsed_args)
E File "/usr/lib/python2.7/site-packages/cliff/display.py", line 116, in run
E column_names, data = self.take_action(parsed_args)
E File "/usr/lib/python2.7/site-packages/openstackclient/compute/v2/hypervisor.py", line 60, in take_action
E data = compute_client.hypervisors.list()
E File "/usr/lib/python2.7/site-packages/novaclient/api_versions.py", line 393, in substitution
E return methods[-1].func(obj, *args, **kwargs)
E File "/usr/lib/python2.7/site-packages/novaclient/v2/hypervisors.py", line 59, in list
E return self._list_base(detailed=detailed)
E File "/usr/lib/python2.7/site-packages/novaclient/v2/hypervisors.py", line 50, in _list_base
E return self._list(path, 'hypervisors')
E File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 254, in _list
E resp, body = self.api.client.get(url)
E File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 375, in get
E return self.request(url, 'GET', **kwargs)
E Fi...

This issue was seen in regression test run 3 test cases failed due to this issue(openstack hypervisor list) .on load 20190622T013000Z
Details: CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list' failed to execute. Output: ^CTraceback (most recent call last):
 E         File "/usr/bin/openstack", line 10, in <module>

raise exceptions.CLIRejected("CLI '{}' failed to execute. Output: {}".format(complete_cmd, cmd_output))
 E       utils.exceptions.CLIRejected: CLI command is rejected.
 E       Details: CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne hypervisor list' failed to execute. Output: ^CTraceback (most recent call last):
 E         File "/usr/bin/openstack", line 10, in <module>
 E           sys.exit(main())
 E         File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 210, in main
 E           return OpenStackShell().run(argv)
 E         File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
 E           ret_val = super(OpenStackShell, self).run(argv)
 E         File "/usr/lib/python2.7/site-packages/cliff/app.py", line 281, in run
 E           result = self.run_subcommand(remainder)
 E         File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 175, in run_subcommand
 E           ret_value = super(OpenStackShell, self).run_subcommand(argv)
 E         File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
 E           result = cmd.run(parsed_args)
 E         File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
 E           return super(Command, self).run(parsed_args)
 E         File "/usr/lib/python2.7/site-packages/cliff/display.py", line 116, in run
 E           column_names, data = self.take_action(parsed_args)
 E         File "/usr/lib/python2.7/site-packages/openstackclient/compute/v2/hypervisor.py", line 60, in take_action
 E           data = compute_client.hypervisors.list()
 E         File "/usr/lib/python2.7/site-packages/novaclient/api_versions.py", line 393, in substitution
 E           return methods[-1].func(obj, *args, **kwargs)
 E         File "/usr/lib/python2.7/site-packages/novaclient/v2/hypervisors.py", line 59, in list
 E           return self._list_base(detailed=detailed)
 E         File "/usr/lib/python2.7/site-packages/novaclient/v2/hypervisors.py", line 50, in _list_base
 E           return self._list(path, 'hypervisors')
 E         File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 254, in _list
 E           resp, body = self.api.client.get(url)
 E         File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 375, in get
 E           return self.request(url, 'GET', **kwargs)
 E         File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 72, in request
 E           **kwargs)
 E         File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 534, in request
 E           resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
 E         File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 237, in request
 E           return self.session.request(url, method, **kwargs)
 E         File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 835, in request
 E           resp = send(**kwargs)
 E         File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 926, in _send_request
 E           resp = self.session.request(method, url, **kwargs)
 E         File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
 E           resp = self.send(prep, **send_kwargs)
 E         File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
 E           r = adapter.send(request, **kwargs)
 E         File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 449, in send
 E           timeout=timeout
 E         File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
 E           chunked=chunked)
 E         File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 379, in _make_request
 E           httplib_response = conn.getresponse(buffering=True)
 E         File "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse
 E           response.begin()
 E         File "/usr/lib64/python2.7/httplib.py", line 444, in begin
 E           version, status, reason = self._read_status()
 E         File "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
 E           line = self.fp.readline(_MAXLINE + 1)
 E         File "/usr/lib64/python2.7/socket.py", line 476, in readline
 E           data = self._sock.recv(self._rbufsize)
 E       KeyboardInterrupt

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-24:

#7

Marking as stx.2.0 gating -- limitation in the current implementation where nova-api-proxy is only running on one controller. This results in delayed recovery (up to 12mins) if the controller w/ nova-api-proxy goes down.

tags:	added: stx.2.0 stx.containers
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Dariush Eslimi (deslimi)

Revision history for this message

Chris Friesen (cbf123) wrote on 2019-06-27:

#8

The 12 minutes is odd, even given that nova-api-proxy is on the standby. I would not expect it to take that long to recover.

Revision history for this message

Chris Friesen (cbf123) wrote on 2019-06-27:

#9

Looking at the logs, controller-0 came up at 2019-06-21T12:59:18. We see mariadb coming up initially at 13:07:41 but it fails to join the DB cluster and aborts at 13:07:51. We see another attempt at 13:08:44 and it is synced with the DB cluster at 13:09:08.

This doesn't explain why the other mariadb server isn't handling the requests though, and we need the logs from the other controller node to answer that.

Revision history for this message

Tao Liu (tliu88) wrote on 2019-06-28:

#10

I suggest a quick test to evaluate the active-active nova-api-proxy deployment prior to making the override changes.

After the stx-openstack is applied
1. kubectl scale deployment nova-api-proxy -n openstack --replicas=2
2. Tail logs for both pods
kubectl logs -f -n openstack <nova-api-proxy-pod1>
kubectl logs -f -n openstack <nova-api-proxy-pod2>
3. Create a VM and perform some VM stop/start/pause actions, and ensure each request is processed by one pod

Revision history for this message

Dariush Eslimi (deslimi) wrote on 2019-07-02:

#11

Cindy, can you please help with assigning this to a member of your team. Thanks.

Changed in starlingx:
assignee:	Dariush Eslimi (deslimi) → Cindy Xie (xxie1)

Revision history for this message

Dariush Eslimi (deslimi) wrote on 2019-07-02:

#12

Bart will take over, as it seems it is more than just implementing active-active proxy.

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Bart Wensley (bartwensley)

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-19:

#13

I have done some testing in the WP_1-2 lab where the issue was originally raised, with a load from July 16th. I am not able to reproduce the 12 minute "openstack server list" outage time. This is likely due to several fixes that have gone in since this LP was raised, including a fix to reduce the amount of time before pods are evicted when a node becomes unavailable.

However, I do see that the "openstack server list" command still fails for over a minute when the standby controller is rebooted and the nova-api-proxy pod is running on that controller. I did some testing with 2 replicas running for the nova-api-proxy (as per Tao's suggestion above) and this improves the time significantly - the "openstack server list" command now fails for about 30 seconds when the standby controller is rebooted. Note that there is always going to be some failure time when a standby AIO-DX controller is rebooted, because the MariaDB goes down on any AIO-DX controller reboot, due to the way we handle redundancy for MariaDB.

I am going to use this LP to make the change to replicate the nova-api-proxy.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-24: Fix proposed to config (master)

#14

Fix proposed to branch: master
Review: https://review.opendev.org/672544

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-24: Fix merged to config (master)

#15

Reviewed: https://review.opendev.org/672544
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=6790b7e7c7b7742170554c5b219a02abbf6fcd16
Submitter: Zuul
Branch: master

commit 6790b7e7c7b7742170554c5b219a02abbf6fcd16
Author: Bart Wensley <email address hidden>
Date: Fri Jul 19 14:14:28 2019 -0500

Replicate nova-api-proxy pod

    The nova-api-proxy is currently running in a single pod on one of
    the controllers. To improve recovery time when a controller
    fails, the nova-api-proxy pod will now be run with replicas
    set to two and anti-affinity configured so there is a pod on
    each controller.

    Closes-bug: 1833730
    Change-Id: Iacd17251b86050e337d9a0f832b9dfa6e9864fce
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-25:

#16

Issue was not observed recently

tags:

removed: stx.retestneeded

StarlingX

"openstack server list" cmd hangs after standby controller reboot

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches