Bug #1837686 “Openstack commands hold prompt > 30 seconds” : Bugs : StarlingX

Cristopher Lemus (cjlemusc) on 2019-07-24

tags:

added: stx.sanity

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-24:

#1

@Christopher, can you provide the collect logs from the standard system when the openstack show cmd takes more than 30 seconds? Please note the timestamp for the cmd in the notes.

summary:	- Openstack commands hold prompt for several seconds + Openstack commands hold prompt > 30 seconds
Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-07-24:

#2

controller-0_20190724.153958.tar Edit (25.3 MiB, application/x-tar)

Hi Ghada,

I uploaded the collect from a Standard External Storage (2+2+2) where we got the failures. I provided some additional details here: http://paste.openstack.org/show/754819/ . One of the iterations is at 9:58 (there could be more than 1 failures).

When I was manually executing some commands on this same hardware, I got a single failure which might be the actual reason:

controller-0:~$ export OS_CLOUD=openstack_helm
controller-0:~$ openstack server list
Unable to establish connection to http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/detail: HTTPConnectionPool(host='nova-api-proxy.openstack.svc.cluster.local', port=8774): Max retries exceeded with url: /v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/detail (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb094274410>: Failed to establish a new connection: [Errno 110] Connection timed out',))

However, I only got this error once. As mentioned the error is random. I'll leave some loop running to try to reproduce it.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-29:

#3

As per input from Frank Miller, marking as stx.2.0 medium priority in order to investigate why the openstack cmds take a long time sometimes.

tags:	added: stx.2.0 stx.containers
Changed in starlingx:
status:	Incomplete → Triaged
importance:	Undecided → Medium
assignee:	nobody → Tao Liu (tliu88)

Tao Liu (tliu88) on 2019-08-06

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Tao Liu (tliu88) wrote on 2019-08-07:

#4

Download full text (5.8 KiB)

Hi Cristopher,

I have not seen anything that could have led to unpause vm-centos-1 timeout after 30 seconds. Have you seen such a timeout issue in the recent loads?

Below are the event sequences discovered from the logs using unpause vm-cirros-1, and time stamp 9:58 as references.

1. User request to pause vm-centos-1, request processed within 1 second
{"log":"2019-07-24 09:57:52,240.240 6 INFO nova_api_proxy.apps.acceptor [-] POST request issued by user (admin) tenant (admin) remote address (192.168.200.84) \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\"\n","stream":"stdout","time":"2019-07-24T09:57:52.240532395Z"}
{"log":"2019-07-24 09:57:52,240.240 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\", action: (pause), val:(None)\n","stream":"stdout","time":"2019-07-24T09:57:52.240553954Z"}
{"log":"2019-07-24 09:57:52,362.362 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['']\n","stream":"stdout","time":"2019-07-24T09:57:52.362365759Z"}

2. Event generated for pausing vm-centos-1
2019-07-24T09:57:52.000 controller-0 fmManager: info { "event_log_id" : "700.115", "reason_text" : "Pause issued by admin against instance vm-centos-1 owned by admin on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:52.246692" }
2019-07-24T09:57:52.000 controller-0 fmManager: info { "event_log_id" : "700.120", "reason_text" : "Pause complete for instance vm-centos-1 now paused on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:52.482061" }

3. User request to unpaused vm-cirros-1
{"log":"2019-07-24 09:57:55,755.755 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/43126d71-75b3-48d1-93c6-44bc66c7b5f1/action\", action: (unpause), val:(None)\n","stream":"stdout","time":"2019-07-24T09:57:55.755677707Z"}
{"log":"2019-07-24 09:57:55,873.873 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['']\n","stream":"stdout","time":"2019-07-24T09:57:55.87341393Z"}

4. Event generated for unpausing vm-cirros-1
2019-07-24T09:57:55.000 controller-0 fmManager: info { "event_log_id" : "700.121", "reason_text" : "Unpause issued by admin against instance vm-cirros-1 owned by admin on host compute-1", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=43126d71-75b3-48d1-93c6-44bc66c7b5f1", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:55.761490" }
2019-07-24T09:57:56.000 controller-0 fmManager: info { "event_log_id" : "700.126", "reason_text" : "Unpause complete for instance vm-cirros-1 now enabled on host compute-1", "entity_instan...

Hi Cristopher,

I have not seen anything that could have led to unpause vm-centos-1 timeout after 30 seconds. Have you seen such a timeout issue in the recent loads?

Below are the event sequences discovered from the logs using unpause vm-cirros-1, and time stamp 9:58 as references. 
 
1. User request to pause vm-centos-1, request processed within 1 second
{"log":"2019-07-24 09:57:52,240.240 6 INFO nova_api_proxy.apps.acceptor [-] POST request issued by user (admin) tenant (admin) remote address (192.168.200.84) \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\"\n","stream":"stdout","time":"2019-07-24T09:57:52.240532395Z"}
{"log":"2019-07-24 09:57:52,240.240 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\", action: (pause), val:(None)\n","stream":"stdout","time":"2019-07-24T09:57:52.240553954Z"}
{"log":"2019-07-24 09:57:52,362.362 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['']\n","stream":"stdout","time":"2019-07-24T09:57:52.362365759Z"}

2. Event generated for pausing vm-centos-1
2019-07-24T09:57:52.000 controller-0 fmManager: info { "event_log_id" : "700.115", "reason_text" : "Pause issued by admin against instance vm-centos-1 owned by admin on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:52.246692" }
2019-07-24T09:57:52.000 controller-0 fmManager: info { "event_log_id" : "700.120", "reason_text" : "Pause complete for instance vm-centos-1 now paused on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:52.482061" }

3. User request to unpaused vm-cirros-1
{"log":"2019-07-24 09:57:55,755.755 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/43126d71-75b3-48d1-93c6-44bc66c7b5f1/action\", action: (unpause), val:(None)\n","stream":"stdout","time":"2019-07-24T09:57:55.755677707Z"}
{"log":"2019-07-24 09:57:55,873.873 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['']\n","stream":"stdout","time":"2019-07-24T09:57:55.87341393Z"}

4. Event generated for unpausing vm-cirros-1
2019-07-24T09:57:55.000 controller-0 fmManager: info { "event_log_id" : "700.121", "reason_text" : "Unpause issued by admin against instance vm-cirros-1 owned by admin on host compute-1", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=43126d71-75b3-48d1-93c6-44bc66c7b5f1", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:55.761490" }
2019-07-24T09:57:56.000 controller-0 fmManager: info { "event_log_id" : "700.126", "reason_text" : "Unpause complete for instance vm-cirros-1 now enabled on host compute-1", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=43126d71-75b3-48d1-93c6-44bc66c7b5f1", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:56.005135" }

5. Alarm generated after vm-centos-1 is paused
2019-07-24T09:58:03.000 controller-0 fmManager: info { "event_log_id" : "700.002", "reason_text" : "Instance vm-centos-1 owned by admin is paused on host compute-0", "entity_instance_id" : "region=RegionOne.system=48e7e3ad-6830-447e-b871-1c8881bf8216.tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "set", "timestamp" : "2019-07-24 09:57:52.479115" }

6. The next action against vm-centos-1 is a stop request, which failed due to vm-centos-1 being in paused state. Was unpause issued prior to the stop request?  I see an unpause request for vm-cirros-1 prior to stop.
{"log":"2019-07-24 09:59:42,600.600 6 INFO nova_api_proxy.apps.acceptor [-] POST request issued by user (admin) tenant (admin) remote address (192.168.200.84) \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\"\n","stream":"stdout","time":"2019-07-24T09:59:42.600798322Z"}
{"log":"2019-07-24 09:59:42,600.600 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\", action: (os-stop), val:(None)\n","stream":"stdout","time":"2019-07-24T09:59:42.600937437Z"}
{"log":"2019-07-24 09:59:42,657.657 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['{\"conflictingRequest\": {\"message\": \"Cannot \\'stop\\' instance a84fe307-c524-407b-9e82-aa977be5c9e9 while it is in vm_state paused\", \"code\": 409}}']\n","stream":"stdout","time":"2019-07-24T09:59:42.657461551Z"}

7. Event generated for stop attempt
2019-07-24T09:59:42.000 controller-0 fmManager: info { "event_log_id" : "700.145", "reason_text" : "Stop issued by admin against instance vm-centos-1 owned by admin on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:59:42.606290" }
2019-07-24T09:59:42.000 controller-0 fmManager: info { "event_log_id" : "700.149", "reason_text" : "Stop failed for instance vm-centos-1 on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:59:42.660203" }

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-08-07:

#5

Hey Tao, thanks for looking into this. So, we are still having this timeout issue randomly, we might skip one day or two, but just yesterday we faced the issue.

Please, take a look at this paste: http://paste.openstack.org/show/755635/

I pasted the last occurrences for each config, Simplex, Duplex, Standard and Standard dedicated storage. The following is the list of the last two commands that failed for each config:

simplex
openstack server show vm-cirros-2
openstack server rebuild vm-cirros-1

duplex
openstack server show vm-cirros-1
openstack network create --project 72a9e5fb8f564f968efbdbc11a3f515a --provider-network-type=vlan --provider-physical-network=physnet0 --provider-segment=10 --share --external external-net0

standard
openstack server pause vm-cirros-2
openstack compute service list --service nova-compute

standard dedicated storage
openstack server show vm-cirros-1
openstack compute service list --service nova-compute

As you might noticed, this is not tied to a config (happening across all configs), a service (the errors that I pasted land on compute, server, network) or an action (it happens on a simple show).

Could this be related to a service required for any action (i.e. keystone)? How can I help to narrow it? Should I upload a 'collect' of each failure? I'm against this because the huge size of the collects, but I can do it if required.

Revision history for this message

Tao Liu (tliu88) wrote on 2019-08-08:

#6

Hi Cristopher,

Can you provide some details on the automated test suite? Does the test execution issue multiple concurrent commands?

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-08-10:

#7

standard_ext_sanity_201908060950_debug.log Edit (4.2 MiB, text/plain)

Hi Tao,

Basically, the automated suite executes a single command at a time, one after another. I'm gonna attach the debug.log file from the sanity suite. There, you will notice the following:

- On line 19808, the suite does a "openstack server resize".
- After, the suite will check for RESIZE/VERIFY_RESIZE status using "Executing command 'export OS_CLOUD=openstack_helm && openstack server show vm-cirros-1|grep -w status|tail -1|awk '{print$4}''."
- Lines 20552/20553 , system fails to respond in time and the error "SSHClientException: Timed out in 30 seconds" is raised.

This is just an example, as stated, different "openstack COMMAND ACTION" might fail.

Revision history for this message

Tao Liu (tliu88) wrote on 2019-08-12:

#8

Thank you Cristopher for the info! Can you attach the collect logs which match the timestamp with the debug.log?

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-08-16:

#9

debug_duplex_Aug16.log Edit (2.3 MiB, text/plain)

The last iteration of this error was found during a Duplex Baremetal Sanity. During the command: "openstack image delete centos".

I attached both, full collect, and also, the suite debug.log.

You will find the error on debug.log, line 5595 and below.

I just noticed that there is a drift of 13 minutes between the server and the workstation where the suite executes.

jenkins@registry2:~$ ssh 192.168.200.76 "date"
sysadmin@192.168.200.76's password:
vie ago 16 12:04:28 UTC 2019
jenkins@registry2:~$ date
vie ago 16 12:17:44 CDT 2019

The error on debug.log is at 20190816 09:13:32.545
Server time will be at ~9:00

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-08-16:

#10

ALL_NODES_20190816.111631.tar Edit (40.9 MiB, application/x-tar)

Attaching collect.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#11

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-08-27:

#12

ALL_NODES_20190827.090729.tar Edit (64.6 MiB, application/x-tar)

Download full text (3.9 KiB)

Hi Tao,

I modified the code that the test suite uses in order to expand the time, from 30 seconds to 3 minutes to wait for the shell. Now, I got this error with ISO `BUILD_DATE="2019-08-25 23:30:00 +0000" `

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
20190827 09:15:00.486 - INFO - +---- START KW: OpenStack.Create Volume [ ${cirros_volume_size} | ${cirros_image_name} | --bootable | ${cirros_volume_name_2} ]
20190827 09:15:00.486 - INFO - +----- START KW: BuiltIn.Set Variable [ openstack volume create ]
20190827 09:15:00.486 - INFO - ${openstack_cmd} = openstack volume create
20190827 09:15:00.486 - INFO - +----- END KW: BuiltIn.Set Variable (0)
20190827 09:15:00.487 - INFO - +----- START KW: BuiltIn.Catenate [ ${openstack_cmd} | --size ${size} | --image ${image} | ${bootable} | ${name} ]
20190827 09:15:00.487 - INFO - ${cmd} = openstack volume create --size 20 --image cirros --bootable vol-cirros-2
20190827 09:15:00.487 - INFO - +----- END KW: BuiltIn.Catenate (1)
20190827 09:15:00.487 - INFO - +----- START KW: OpenStack.Run OS Command [ ${cmd} | True | 3 min ]
20190827 09:15:00.487 - INFO - +------ START KW: BuiltIn.Set Variable [ export OS_CLOUD=openstack_helm ]
20190827 09:15:00.487 - INFO - ${load_os_token} = export OS_CLOUD=openstack_helm
20190827 09:15:00.487 - INFO - +------ END KW: BuiltIn.Set Variable (0)
20190827 09:15:00.488 - INFO - +------ START KW: SSHLibrary.Execute Command [ ${load_os_token} && ${cmd} | return_stdout=True | return_stderr=True | return_rc=True | timeout=${timeout} ]
20190827 09:15:00.488 - INFO - Executing command 'export OS_CLOUD=openstack_helm && openstack volume create --size 20 --image cirros --bootable vol-cirros-2'.
20190827 09:17:09.079 - INFO - Command exited with return code 1.
20190827 09:17:09.079 - INFO - ${stdout} =
20190827 09:17:09.079 - INFO - ${stderr} = Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v2/7a413f8546d246059fd3e95f2bf1977f/volumes: HTTPConnectionPool(host='cinder-api.openstack.svc.cluster.local', port...
20190827 09:17:09.079 - INFO - ${rc} = 1
20190827 09:17:09.080 - INFO - +------ END KW: SSHLibrary.Execute Command (128592)
20190827 09:17:09.080 - INFO - +------ START KW: BuiltIn.Create Dictionary [ stdout=${stdout} | stderr=${stderr} | rc=${rc} ]
20190827 09:17:09.082 - INFO - ${res} = {u'stdout': u'', u'stderr': u"Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v2/7a413f8546d246059fd3e95f2bf1977f/volumes: HTTPConnectionPool(host='cinder-api.open...
20190827 09:17:09.082 - INFO - +------ END KW: BuiltIn.Create Dictionary (2)
20190827 09:17:09.082 - INFO - +------ START KW: BuiltIn.Run Keyword If [ ${rc} != 0 and ${fail_if_error} == True | FAIL | ${stderr} ]
20190827 09:17:09.084 - INFO - +------- START KW: BuiltIn.Fail [ ${stderr} ]
20190827 09:17:09.084 - FAIL - Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v2/7a413f8546d246059fd3e95f2bf1977f/volumes: HTTPConnectionPool(host='cinder-api.openstack.svc.cluster.local', port=8776): Max retries exceeded with url: /v2/7a413f8546d246059fd3e95f2bf1977f/volumes (Caused by NewConnectionErro...

StarlingX

Openstack commands hold prompt > 30 seconds

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches