SSL CA certificate installation occasionally stalls or times out

Bug #1874523 reported by Tee Ngo
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Yuxing

Bug Description

Brief Description
-----------------
SSL CA certificate installation during bootstrap occasionally stalls or times out. When this occurs during the bootstrap of a subcloud, the playbook has to be deleted and re-added.

Severity
--------
Minor

Steps to Reproduce
------------------
Add a subcloud

Expected Behavior
------------------
The subcloud is bootstrapped and deployed successfully

Actual Behavior
----------------
Occasionally a subcloud fails to be bootstrapped as the ssl ca certificate installation times out

TASK [bootstrap/persist-config : Ensure docker and containerd config directory exist] ***

TASK [bootstrap/persist-config : Ensure docker and containerd proxy config exist] ***

TASK [bootstrap/persist-config : Write header to docker and containerd proxy conf files] ***

TASK [bootstrap/persist-config : Add http proxy URL to docker and containerd proxy conf files] ***

TASK [bootstrap/persist-config : Add https proxy URL to docker and containerd proxy conf files] ***

TASK [bootstrap/persist-config : Add no proxy address list to docker proxy config file] ***

TASK [bootstrap/persist-config : Add no proxy address list to containerd proxy config file] ***

TASK [bootstrap/persist-config : Restart Docker and containerd] ****************

TASK [bootstrap/persist-config : Copy ssl_ca certificate] **********************
changed: [subcloud11]

TASK [bootstrap/persist-config : Remove ssl_ca complete flag] ******************
changed: [subcloud11]

TASK [bootstrap/persist-config : Add ssl_ca certificate] ***********************
changed: [subcloud11]

TASK [bootstrap/persist-config : Wait for certificate install] *****************
fatal: [subcloud11]: FAILED! => {"changed": false, "elapsed": 360, "msg": "Timeout waiting for ssl_ca certificate install"}

PLAY RECAP *********************************************************************
subcloud11 : ok=169 changed=65 unreachable=0 failed=1

Reproducibility
---------------
Stalling was seen a couple of times during a subcloud deployment. Timeout was seen once.

System Configuration
--------------------
IPv6 distributed cloud

Branch/Pull Time/Commit
-----------------------
April 20th master build

Last Pass
---------
This is an intermittent issue.

Timestamp/Logs
--------------
See above for an sample

Unfortunately the VM running the affected subcloud was taken down and relaunched, hence the subcloud log is not available. The timeout failures were also seen a few times by a test team member (Yosief Gebremariam) who was deploying subclouds in his lab. Subcloud logs will be uploaded next time the issue occurs again. In the meantime, I'd suggest to profile the ssl ca certificate installation task during bootstrap. Vigorous bootstrap replay tests may also trigger the issue.

Test Activity
-------------
Evaluation

 Workaround
 ----------
 Delete the failed subcloud then re-add

Revision history for this message
Ghada Khalil (gkhalil) wrote :

From David Sullivan:
Looks like there was an issue with rabbit for about 8-9 minutes. Oddly it looks like the agent was able to connect but not the sysinv API.

Once that cleared the message was received by the conductor and the operation completed about 9s later.

sysinv 2020-04-23 01:03:38.517 101545 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on localhost:5672
sysinv 2020-04-23 01:03:43.526 101545 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: timed out. Trying again in 13 seconds.: timeout: timed out
=INFO REPORT==== 23-Apr-2020::01:03:38 ===
accepting AMQP connection <0.2798.0> (127.0.0.1:40690 -> 127.0.0.1:5672)

=ERROR REPORT==== 23-Apr-2020::01:03:48 ===
closing AMQP connection <0.2798.0> (127.0.0.1:40690 -> 127.0.0.1:5672):
{handshake_timeout,frame_header}

tags: added: stx.4.0 stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Dariush Eslimi (deslimi)
Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Yuxing (yuxing)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Dariush Eslimi (flock/DC PL), this is a rare issue, so it's recommended to move to stx.5.0

tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
Yuxing (yuxing) wrote :

This is an intermittent issue. Need a set of the complete logs from the system controller( and from the subcloud controller if possible) when this issue happens again, in order to analyze the cause and reproduce steps.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This appears to have been a single occurrence which has not been reported again since April 2020. Based on the notes above, it appears that there aren't enough logs to determine what happened. As per agreement with Dariush Eslimi (config PL), we're closing this. A new LP should be opened with a full set of logs if this is seen again.

Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing as the issue is not reproducible

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.