SSL CA certificate installation occasionally stalls or times out
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Invalid
|
Medium
|
Yuxing |
Bug Description
Brief Description
-----------------
SSL CA certificate installation during bootstrap occasionally stalls or times out. When this occurs during the bootstrap of a subcloud, the playbook has to be deleted and re-added.
Severity
--------
Minor
Steps to Reproduce
------------------
Add a subcloud
Expected Behavior
------------------
The subcloud is bootstrapped and deployed successfully
Actual Behavior
----------------
Occasionally a subcloud fails to be bootstrapped as the ssl ca certificate installation times out
TASK [bootstrap/
TASK [bootstrap/
TASK [bootstrap/
TASK [bootstrap/
TASK [bootstrap/
TASK [bootstrap/
TASK [bootstrap/
TASK [bootstrap/
TASK [bootstrap/
changed: [subcloud11]
TASK [bootstrap/
changed: [subcloud11]
TASK [bootstrap/
changed: [subcloud11]
TASK [bootstrap/
fatal: [subcloud11]: FAILED! => {"changed": false, "elapsed": 360, "msg": "Timeout waiting for ssl_ca certificate install"}
PLAY RECAP *******
subcloud11 : ok=169 changed=65 unreachable=0 failed=1
Reproducibility
---------------
Stalling was seen a couple of times during a subcloud deployment. Timeout was seen once.
System Configuration
-------
IPv6 distributed cloud
Branch/Pull Time/Commit
-------
April 20th master build
Last Pass
---------
This is an intermittent issue.
Timestamp/Logs
--------------
See above for an sample
Unfortunately the VM running the affected subcloud was taken down and relaunched, hence the subcloud log is not available. The timeout failures were also seen a few times by a test team member (Yosief Gebremariam) who was deploying subclouds in his lab. Subcloud logs will be uploaded next time the issue occurs again. In the meantime, I'd suggest to profile the ssl ca certificate installation task during bootstrap. Vigorous bootstrap replay tests may also trigger the issue.
Test Activity
-------------
Evaluation
Workaround
----------
Delete the failed subcloud then re-add
Changed in starlingx: | |
assignee: | Dariush Eslimi (deslimi) → Yuxing (yuxing) |
From David Sullivan:
Looks like there was an issue with rabbit for about 8-9 minutes. Oddly it looks like the agent was able to connect but not the sysinv API.
Once that cleared the message was received by the conductor and the operation completed about 9s later.
sysinv 2020-04-23 01:03:38.517 101545 INFO sysinv. openstack. common. rpc.common [-] Reconnecting to AMQP server on localhost:5672 openstack. common. rpc.common [-] AMQP server on localhost:5672 is unreachable: timed out. Trying again in 13 seconds.: timeout: timed out 2020::01: 03:38 ===
sysinv 2020-04-23 01:03:43.526 101545 ERROR sysinv.
=INFO REPORT==== 23-Apr-
accepting AMQP connection <0.2798.0> (127.0.0.1:40690 -> 127.0.0.1:5672)
=ERROR REPORT==== 23-Apr- 2020::01: 03:48 === timeout, frame_header}
closing AMQP connection <0.2798.0> (127.0.0.1:40690 -> 127.0.0.1:5672):
{handshake_