Openstack commands hold prompt > 30 seconds

Bug #1837686 reported by Cristopher Lemus
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Medium
Tao Liu

Bug Description

Brief Description
-----------------
During sanity execution, several openstack commands are used (openstack server, openstack image, openstack volume, etc.). Randomly, some of the commands take several seconds to complete and return the shell. This breaks automated sanity execution when it exceed 30 seconds.

Severity
--------
Provide the severity of the defect.
Major: On automated testing, this breaks executions. An automated re-execution or manual is a good workaround to this issue.

Steps to Reproduce
------------------
After an environment is fully provisioned. This can be reproduced by using any openstack command, they usually takes 1-2 seconds to return the prompt. From time to time, it can take >30 seconds to complete and return prompt, which breaks automation. This happens with different configurations, and different openstack options. This is not tied to the actual action, like creating a VM which can take several minutes, the lack of the prompt is what breaks the automation.

Expected Behavior
------------------
After executing an openstack command, prompt should be available in 1-2 seconds.

Actual Behavior
----------------
Execute an openstack command, prompt returns after >30seconds.

Reproducibility
---------------
Intermittent. It happens randomly, but the frequency has increased in the last couple of weeks.

System Configuration
--------------------
Baremetal and virtual environments. Simplex, Duplex, Standard, Standard-External.

Branch/Pull Time/Commit
-----------------------
Using latest ISO from 20190723T141142Z. This issue is recurrent since July/04. Across all configurations and environments, we execute around 400 openstack test cases daily. This issue breaks on average 10% of them. As mentioned before, we re-execute the test cases (manually or automated) and the test case passes.

Last Pass
---------
As a reference, I checked on our old logs, this issue is recurrent since first week of July, but it seems to be increasing. Before that date, this happened just a handful of times.

Timestamp/Logs
--------------
Here are some examples of the commands that are failing:
http://paste.openstack.org/show/754785/
I'm not sure if a collect will actually help to debug, let me know and I'll provide it. With the next build, I'd use the healthiest configuration to ran some commands several times and capture the time that it takes to return the prompt.

Test Activity
-------------
Sanity.

tags: added: stx.sanity
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Christopher, can you provide the collect logs from the standard system when the openstack show cmd takes more than 30 seconds? Please note the timestamp for the cmd in the notes.

summary: - Openstack commands hold prompt for several seconds
+ Openstack commands hold prompt > 30 seconds
Changed in starlingx:
status: New → Incomplete
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Ghada,

I uploaded the collect from a Standard External Storage (2+2+2) where we got the failures. I provided some additional details here: http://paste.openstack.org/show/754819/ . One of the iterations is at 9:58 (there could be more than 1 failures).

When I was manually executing some commands on this same hardware, I got a single failure which might be the actual reason:

controller-0:~$ export OS_CLOUD=openstack_helm
controller-0:~$ openstack server list
Unable to establish connection to http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/detail: HTTPConnectionPool(host='nova-api-proxy.openstack.svc.cluster.local', port=8774): Max retries exceeded with url: /v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/detail (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb094274410>: Failed to establish a new connection: [Errno 110] Connection timed out',))

However, I only got this error once. As mentioned the error is random. I'll leave some loop running to try to reproduce it.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per input from Frank Miller, marking as stx.2.0 medium priority in order to investigate why the openstack cmds take a long time sometimes.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
assignee: nobody → Tao Liu (tliu88)
Tao Liu (tliu88)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Tao Liu (tliu88) wrote :
Download full text (5.8 KiB)

Hi Cristopher,

I have not seen anything that could have led to unpause vm-centos-1 timeout after 30 seconds. Have you seen such a timeout issue in the recent loads?

Below are the event sequences discovered from the logs using unpause vm-cirros-1, and time stamp 9:58 as references.

1. User request to pause vm-centos-1, request processed within 1 second
{"log":"2019-07-24 09:57:52,240.240 6 INFO nova_api_proxy.apps.acceptor [-] POST request issued by user (admin) tenant (admin) remote address (192.168.200.84) \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\"\n","stream":"stdout","time":"2019-07-24T09:57:52.240532395Z"}
{"log":"2019-07-24 09:57:52,240.240 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/a84fe307-c524-407b-9e82-aa977be5c9e9/action\", action: (pause), val:(None)\n","stream":"stdout","time":"2019-07-24T09:57:52.240553954Z"}
{"log":"2019-07-24 09:57:52,362.362 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['']\n","stream":"stdout","time":"2019-07-24T09:57:52.362365759Z"}

2. Event generated for pausing vm-centos-1
2019-07-24T09:57:52.000 controller-0 fmManager: info { "event_log_id" : "700.115", "reason_text" : "Pause issued by admin against instance vm-centos-1 owned by admin on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:52.246692" }
2019-07-24T09:57:52.000 controller-0 fmManager: info { "event_log_id" : "700.120", "reason_text" : "Pause complete for instance vm-centos-1 now paused on host compute-0", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=a84fe307-c524-407b-9e82-aa977be5c9e9", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:52.482061" }

3. User request to unpaused vm-cirros-1
{"log":"2019-07-24 09:57:55,755.755 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV \"POST http://nova-api-proxy.openstack.svc.cluster.local:8774/v2.1/ae1dc259ba42487bad3b3eb814c94c07/servers/43126d71-75b3-48d1-93c6-44bc66c7b5f1/action\", action: (unpause), val:(None)\n","stream":"stdout","time":"2019-07-24T09:57:55.755677707Z"}
{"log":"2019-07-24 09:57:55,873.873 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['']\n","stream":"stdout","time":"2019-07-24T09:57:55.87341393Z"}

4. Event generated for unpausing vm-cirros-1
2019-07-24T09:57:55.000 controller-0 fmManager: info { "event_log_id" : "700.121", "reason_text" : "Unpause issued by admin against instance vm-cirros-1 owned by admin on host compute-1", "entity_instance_id" : "tenant=ae1dc259-ba42-487b-ad3b-3eb814c94c07.instance=43126d71-75b3-48d1-93c6-44bc66c7b5f1", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-24 09:57:55.761490" }
2019-07-24T09:57:56.000 controller-0 fmManager: info { "event_log_id" : "700.126", "reason_text" : "Unpause complete for instance vm-cirros-1 now enabled on host compute-1", "entity_instan...

Read more...

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hey Tao, thanks for looking into this. So, we are still having this timeout issue randomly, we might skip one day or two, but just yesterday we faced the issue.

Please, take a look at this paste: http://paste.openstack.org/show/755635/

I pasted the last occurrences for each config, Simplex, Duplex, Standard and Standard dedicated storage. The following is the list of the last two commands that failed for each config:

simplex
openstack server show vm-cirros-2
openstack server rebuild vm-cirros-1

duplex
openstack server show vm-cirros-1
openstack network create --project 72a9e5fb8f564f968efbdbc11a3f515a --provider-network-type=vlan --provider-physical-network=physnet0 --provider-segment=10 --share --external external-net0

standard
openstack server pause vm-cirros-2
openstack compute service list --service nova-compute

standard dedicated storage
openstack server show vm-cirros-1
openstack compute service list --service nova-compute

As you might noticed, this is not tied to a config (happening across all configs), a service (the errors that I pasted land on compute, server, network) or an action (it happens on a simple show).

Could this be related to a service required for any action (i.e. keystone)? How can I help to narrow it? Should I upload a 'collect' of each failure? I'm against this because the huge size of the collects, but I can do it if required.

Revision history for this message
Tao Liu (tliu88) wrote :

Hi Cristopher,

Can you provide some details on the automated test suite? Does the test execution issue multiple concurrent commands?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Tao,

Basically, the automated suite executes a single command at a time, one after another. I'm gonna attach the debug.log file from the sanity suite. There, you will notice the following:

- On line 19808, the suite does a "openstack server resize".
- After, the suite will check for RESIZE/VERIFY_RESIZE status using "Executing command 'export OS_CLOUD=openstack_helm && openstack server show vm-cirros-1|grep -w status|tail -1|awk '{print$4}''."
- Lines 20552/20553 , system fails to respond in time and the error "SSHClientException: Timed out in 30 seconds" is raised.

This is just an example, as stated, different "openstack COMMAND ACTION" might fail.

Revision history for this message
Tao Liu (tliu88) wrote :

Thank you Cristopher for the info! Can you attach the collect logs which match the timestamp with the debug.log?

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

The last iteration of this error was found during a Duplex Baremetal Sanity. During the command: "openstack image delete centos".

I attached both, full collect, and also, the suite debug.log.

You will find the error on debug.log, line 5595 and below.

I just noticed that there is a drift of 13 minutes between the server and the workstation where the suite executes.

jenkins@registry2:~$ ssh 192.168.200.76 "date"
sysadmin@192.168.200.76's password:
vie ago 16 12:04:28 UTC 2019
jenkins@registry2:~$ date
vie ago 16 12:17:44 CDT 2019

The error on debug.log is at 20190816 09:13:32.545
Server time will be at ~9:00

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Attaching collect.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Download full text (3.9 KiB)

Hi Tao,

I modified the code that the test suite uses in order to expand the time, from 30 seconds to 3 minutes to wait for the shell. Now, I got this error with ISO `BUILD_DATE="2019-08-25 23:30:00 +0000" `

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
20190827 09:15:00.486 - INFO - +---- START KW: OpenStack.Create Volume [ ${cirros_volume_size} | ${cirros_image_name} | --bootable | ${cirros_volume_name_2} ]
20190827 09:15:00.486 - INFO - +----- START KW: BuiltIn.Set Variable [ openstack volume create ]
20190827 09:15:00.486 - INFO - ${openstack_cmd} = openstack volume create
20190827 09:15:00.486 - INFO - +----- END KW: BuiltIn.Set Variable (0)
20190827 09:15:00.487 - INFO - +----- START KW: BuiltIn.Catenate [ ${openstack_cmd} | --size ${size} | --image ${image} | ${bootable} | ${name} ]
20190827 09:15:00.487 - INFO - ${cmd} = openstack volume create --size 20 --image cirros --bootable vol-cirros-2
20190827 09:15:00.487 - INFO - +----- END KW: BuiltIn.Catenate (1)
20190827 09:15:00.487 - INFO - +----- START KW: OpenStack.Run OS Command [ ${cmd} | True | 3 min ]
20190827 09:15:00.487 - INFO - +------ START KW: BuiltIn.Set Variable [ export OS_CLOUD=openstack_helm ]
20190827 09:15:00.487 - INFO - ${load_os_token} = export OS_CLOUD=openstack_helm
20190827 09:15:00.487 - INFO - +------ END KW: BuiltIn.Set Variable (0)
20190827 09:15:00.488 - INFO - +------ START KW: SSHLibrary.Execute Command [ ${load_os_token} && ${cmd} | return_stdout=True | return_stderr=True | return_rc=True | timeout=${timeout} ]
20190827 09:15:00.488 - INFO - Executing command 'export OS_CLOUD=openstack_helm && openstack volume create --size 20 --image cirros --bootable vol-cirros-2'.
20190827 09:17:09.079 - INFO - Command exited with return code 1.
20190827 09:17:09.079 - INFO - ${stdout} =
20190827 09:17:09.079 - INFO - ${stderr} = Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v2/7a413f8546d246059fd3e95f2bf1977f/volumes: HTTPConnectionPool(host='cinder-api.openstack.svc.cluster.local', port...
20190827 09:17:09.079 - INFO - ${rc} = 1
20190827 09:17:09.080 - INFO - +------ END KW: SSHLibrary.Execute Command (128592)
20190827 09:17:09.080 - INFO - +------ START KW: BuiltIn.Create Dictionary [ stdout=${stdout} | stderr=${stderr} | rc=${rc} ]
20190827 09:17:09.082 - INFO - ${res} = {u'stdout': u'', u'stderr': u"Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v2/7a413f8546d246059fd3e95f2bf1977f/volumes: HTTPConnectionPool(host='cinder-api.open...
20190827 09:17:09.082 - INFO - +------ END KW: BuiltIn.Create Dictionary (2)
20190827 09:17:09.082 - INFO - +------ START KW: BuiltIn.Run Keyword If [ ${rc} != 0 and ${fail_if_error} == True | FAIL | ${stderr} ]
20190827 09:17:09.084 - INFO - +------- START KW: BuiltIn.Fail [ ${stderr} ]
20190827 09:17:09.084 - FAIL - Unable to establish connection to http://cinder-api.openstack.svc.cluster.local:8776/v2/7a413f8546d246059fd3e95f2bf1977f/volumes: HTTPConnectionPool(host='cinder-api.openstack.svc.cluster.local', port=8776): Max retries exceeded with url: /v2/7a413f8546d246059fd3e95f2bf1977f/volumes (Caused by NewConnectionErro...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on regression testing done for train, this issue is very rare. As per Yang Liu, this is possibly happening < 1% of the time. As such, there is no plan to fix this issue at this time.

Changed in starlingx:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.