"openstack server list" cmd not working in 5 mins after active controller reboot

Bug #1829432 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

Brief Description
-----------------
reboot active controller, check server list by using " openstack server list" cmd

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system
Multi-node system

Lab-name: WCP_99-103

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-15_18-01-07

Last Pass
---------
2019-05-07_14-47-37

Timestamp/Logs
--------------
[2019-05-16 17:09:54,981] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-05-16 17:09:56,945] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+--------------+--------+------------------------------------------------------------+-------+----------------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------+--------+------------------------------------------------------------+-------+----------------------+
| 99004abf-2071-44da-8054-fb56b1e8a06a | tenant2-vm-1 | ACTIVE | tenant2-mgmt-net=192.168.247.84; tenant2-net1=172.18.1.220 | | flavor-default-size2 |
+--------------------------------------+--------------+--------+------------------------------------------------------------+-------+----------------------+

[2019-05-16 17:19:06,438] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-16 17:25:12,026] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-05-16 17:27:39,637] 387 DEBUG MainThread ssh.expect :: Output:
Unable to establish connection to http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/69796a53984b426d8cb6932ef8d68fc3/servers/detail?all_tenants=True: HTTPConnectionPool(host='nova-api-proxy.openstack.svc.cluster.local', port=8774): Max retries exceeded with url: /v2.1/69796a53984b426d8cb6932ef8d68fc3/servers/detail?all_tenants=True (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4dbb4403d0>: Failed to establish a new connection: [Errno 110] Connection timed out',))
controller-1:~$

[2019-05-16 17:31:28,757] 262 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-05-16 17:31:30,579] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+---------------------+--------+------------------------------------------------------------+------------------+----------------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+---------------------+--------+------------------------------------------------------------+------------------+----------------------+
| 8dbec0a9-af0a-4f6a-8a89-04f69dcc2664 | tenant2-swact_img-2 | ACTIVE | tenant2-mgmt-net=192.168.247.77; tenant2-net1=172.18.1.221 | tis-centos-guest | flavor-default-size2 |
| 99004abf-2071-44da-8054-fb56b1e8a06a | tenant2-vm-1 | ACTIVE | tenant2-mgmt-net=192.168.247.84; tenant2-net1=172.18.1.220 | | flavor-default-size2 |
+--------------------------------------+---------------------+--------+------------------------------------------------------------+------------------+----------------------+

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Al Bailey (albailey1974) wrote :

The mariadb pods are still coming up during that 5 minute window

2019-05-16T17:30:04.666 controller-0 kubelet[92421]: info E0516 17:30:04.666795 92421 pod_workers.go:190] Error syncing pod 7ad4aaea-77f6-11e9-bfac-3cfdfe9f7330 ("mariadb-server-1_openstack(7ad4aaea-77f6-11e9-bfac-3cfdfe9f7330)"), skipping: failed to "StartContainer" for "mariadb" with CrashLoopBackOff: "Back-off 1m20s restarting failed container=mariadb pod=mariadb-server-1_openstack(7ad4aaea-77f6-11e9-bfac-3cfdfe9f7330)"

That is the last mariadb restart failure log in /var/log/daemon.log
I believe it started working after this, which would explain why the calls start to work.

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating for now until further investigation. Not very urgent given that the cmd works after a period of time.

The investigation needs to focus on whether the mariadb recovery time is acceptable.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.sanity
tags: added: stx.2.0
tags: added: stx.containers
Changed in starlingx:
assignee: nobody → Chris Friesen (cbf123)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
Revision history for this message
Chris Friesen (cbf123) wrote :

The comment above about mariadb pods coming up during the 5-min window is a bit misleading, as only one of the mariadb pods went down. The other mariadb pod and the garbd pod stayed up and should have preserved quorum.

The timeline looks something like this:

2019-05-16 17:19:06 test script reboots active controller (controller-0) (where nova-api-proxy is running) nova-api-proxy has 300sec tolerations, so we're not going to respond to API requests for at least 5 min.

17:19:09 mariadb-server-0 (confusingly running on controller-1) detects loss of connectivity to mariadb-server-1. However, garbd is still in the cluster so we should preserve quorum.

17:20:42 nova-api-osapi on controller-1 processing API requests

17:24:30 nova-api-proxy on controller-0 starts up again

17:27:08 mariadb-server-0 on controller-1 sees mariadb-server-1 try to rejoin cluster

17:28:42 mariadb-server-1 on controller-0 tries to rejoin cluster, fails at 17:28:46

17:28:43 nova-api-osapi on controller-1 processing API requests

17:28:43 mariadb-server-0 on controller-1 sees a bunch of "Aborted connection to db" logs for different processes.

17:28:45 nova-api-osapi on controller-1 sees sql connection get reset

17:30:18 mariadb-server-1 on controller-0 tries to rejoin DB cluster again

17:30:37 mariadb-server-1 on controller-1 synced with DB cluster

17:30:54 nova-api-proxy on controller-0 first nova API call after startup, response a second later

17:31:26 mariadb-server-0 on controller-1 sees a bunch of "Aborted connection to db" logs for different processes.

17:31:28 nova-api-osapi on controller-1 sees sql connection get reset

17:31:28 nova-api-osapi on controller-1 processing API requests

Revision history for this message
Chris Friesen (cbf123) wrote :

It's worth highlighting that there is only one nova-api-proxy pod which will only run on controller nodes, and it has the following default tolerations:

Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

So when a controller node is force-rebooted, kubernetes will wait 5 minutes before even thinking about respawning the nova-api-proxy pod. This means that external calls to the openstack compute API will fail for that period.

It might be worth adjusting these tolerations to a smaller duration, or looking at running one pod per node. (This might require careful design to avoid races.)

It's also not clear to me why there was such a large delay between nova-api-proxy starting up at 17:24:30 but only processing the first API request at 17:30:54. It kind of looks like it might be due to delays doing keystone authentication:

{"log":"2019-05-16 17:24:30,793.793 8 INFO keystonemiddleware.auth_token [-] Starting Keystone auth_token middleware\n","stream":"stdout","time":"2019-05-16T17:24:30.794211517Z"}
{"log":"2019-05-16 17:24:30,797.797 8 WARNING keystonemiddleware.auth_token [-] AuthToken middleware is set with keystone_authtoken.service_token_roles_required set to False. This is backwards compatible but deprecated behaviour. Please set this
to True.\n","stream":"stdout","time":"2019-05-16T17:24:30.797502535Z"}
{"log":"No handlers could be found for logger \"oslo_config.cfg\"\n","stream":"stderr","time":"2019-05-16T17:24:30.822828006Z"}
{"log":"2019-05-16 17:24:30,823.823 8 INFO nova_api_proxy.common.service [-] osapi_proxy listening on 0.0.0.0:8774\n","stream":"stdout","time":"2019-05-16T17:24:30.824246725Z"}
{"log":"2019-05-16 17:30:53,700.700 8 WARNING keystonemiddleware.auth_token [-] Using the in-process token cache is deprecated as of the 4.2.0 release and may be removed in the 5.0.0 release or the 'O' development cycle. The in-process cache caus
es inconsistent results and high memory usage. When the feature is removed the auth_token middleware will not cache tokens by default which may result in performance issues. It is recommended to use memcache for the auth_token token cache by set
ting the memcached_servers option.\n","stream":"stdout","time":"2019-05-16T17:30:53.700504014Z"}

The keystone-api pod on controller-0 starts up at 17:28:56 but doesn't really seem to be happy until 17:29:32.

I see the following logs in the keystone-api pod on controller-1, which seem to line up well with the nova-api-proxy log above:

{"log":"172.16.1.8 - - [16/May/2019:17:30:54 +0000] \"GET /v3 HTTP/1.1\" 200 280 \"-\" \"nova-api-os-compute keystoneauth1/3.13.1 python-requests/2.21.0 CPython/2.7.5\"\n","stream":"stdout","time":"2019-05-16T17:30:54.331232862Z"}
{"log":"172.16.1.8 - - [16/May/2019:17:30:54 +0000] \"POST /v3/auth/tokens HTTP/1.1\" 201 10158 \"-\" \"nova-api-os-compute keystoneauth1/3.13.1 python-requests/2.21.0 CPython/2.7.5\"\n","stream":"stdout","time":"2019-05-16T17:30:54.613528716Z"}

Revision history for this message
Chris Friesen (cbf123) wrote :

There has been a recent update to the VIM (https://opendev.org/starlingx/nfv/commit/cdd6c334d9d1d6c0f4de344fff8ef2af28c76e56) which will help address the issue with nova-api-proxy. AS of that commit, if the node where the proxy is running goes down, the pod should get evicted quite quickly rather than waiting for 5 minutes. This should address the original problem in this bug.

Chris Friesen (cbf123)
Changed in starlingx:
status: Triaged → Fix Committed
Chris Friesen (cbf123)
Changed in starlingx:
status: Fix Committed → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: SM_2
Load: 2019-06-18_20-18-00

13 mins after reboot, "server list" still not working.

collect log attached

[2019-06-19 21:04:21,116] 268 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-06-19 21:17:55,508] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2019-06-19 21:17:56,972] 387 DEBUG MainThread ssh.expect :: Output:
Failed to discover available identity versions when contacting http://keystone.openstack.svc.cluster.local/v3. Attempting to parse version from URL.
Service Unavailable (HTTP 503)
controller-0:~$

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
Peng Peng (ppeng) wrote :

Also saw issue reproduced on
Lab: WCP_71_75
Load: 20190620T013000Z

[sysadmin@controller-1 ~(keystone_admin)]$
[2019-06-20 17:58:14,716] 150 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-1
[2019-06-20 17:58:14,716] 268 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-06-20 18:01:20,934] 268 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list'
[2019-06-20 18:01:31,045] 358 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ '].
expect timeout.

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/667738

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
Chris Friesen (cbf123) wrote :

Looking at the kernel logs for controller-1 in the WCP_71_76 tarball in comment 10 above, I see a log emitted at 2019-06-20T17:57:52 but then nothing until kernel boot logs at 2019-06-20T18:21:25.

Not sure why it took so long to boot, but it's a separate issue from the problem this bug is trying to deal with.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/667738
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=f5cdd21bc5db71c75681d7f3883ce18704423495
Submitter: Zuul
Branch: master

commit f5cdd21bc5db71c75681d7f3883ce18704423495
Author: Chris Friesen <email address hidden>
Date: Wed Jun 26 15:38:52 2019 -0600

    Change default pod tolerations to 30sec

    By default, pods will get a toleration of 300 seconds for the
    notReady:NoExecute and unreachable:NoExecute taints. This is
    excessively long for our purposes.

    This commit changes the commandline args for kube-apiserver to use
    30 seconds for each of these instead to get a faster response time.

    Change-Id: Ic7153e0ea03a1835efcd907cbcef8c8715a0539d
    Partial-Bug: 1829432
    Signed-off-by: Chris Friesen <email address hidden>

Frank Miller (sensfan22)
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Frank Miller (sensfan22) wrote :

Changing status to Fix Released. After the partial commit by Chris as well as numerous other changes over the past 2 months, the time required for openstack commands to work after an active controller reboot was measured on a standard and storage system. Times were well under 5 minutes:

With 20190720T013000Z load on storage system: 2 mins 6 secs
With 20190725T013000Z load on standard system: 1 min 29 secs

Based on the above, marking this LP as fix released. (Note: a separate LP 1838411 is open to reduce these times further)

Revision history for this message
Peng Peng (ppeng) wrote :

The issue was not reproduced recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.