RPC timeout error when creating barbican secret on host-bulk-add after controller-0 is up

Bug #1839665 reported by Yang Liu
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Alexander Kozyrev

Bug Description

Brief Description
-----------------
After controller-0 is installed and configured, run system host-bulk-add <xmlfile>, the cmd will fail due to RPC timeout.

Severity
--------
Major

Steps to Reproduce
------------------
- Install and configure controller-0 using ansible
- unlock controller-0 and wait for it available and pods are running
- system host-bulk-add other hosts

TC-name: system install

Expected Behavior
------------------
- hosts successfully added

Actual Behavior
----------------
[2019-08-09 14:06:49,438] 301 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-bulk-add hosts_bulk_add.xml'
[2019-08-09 14:08:51,518] 423 DEBUG MainThread ssh.expect :: Output:
Error:
 controller: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "create_barbican_secret" info: "<unknown>"
 controller: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "create_barbican_secret" info: "<unknown>"

Reproducibility
---------------
Reproducible
This is seen on multiple systems on 0808 and later loads, and was not seen on 0805 load.
To work around it, rerun this command after some time.

System Configuration
--------------------
AIO-DX, Standard, Dedicated storage systems

Branch/Pull Time/Commit
-----------------------
We see this issue on both master the stx2.0 branches.
stx master as of 20190809
stx.2.0 as of 20190808

Last Pass
---------
20190805

Timestamp/Logs
--------------
# controller-0 is up and pods are running
[2019-08-09 14:06:46,730] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod -n=kube-system -o=wide'
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-64d7b76bbd-xwqn2 1/1 Running 1 44m 172.16.192.68 controller-0 <none> <none>
calico-node-grtkc 1/1 Running 1 44m 192.168.204.3 controller-0 <none> <none>
coredns-5c4849b47c-k9vqm 1/1 Running 1 44m 172.16.192.67 controller-0 <none> <none>
coredns-5c4849b47c-lrrjv 0/1 Pending 0 44m <none> <none> <none> <none>
kube-apiserver-controller-0 1/1 Running 3 43m 192.168.204.3 controller-0 <none> <none>
kube-controller-manager-controller-0 1/1 Running 1 43m 192.168.204.3 controller-0 <none> <none>
kube-multus-ds-amd64-nhvzv 1/1 Running 1 44m 192.168.204.3 controller-0 <none> <none>
kube-proxy-6sxg6 1/1 Running 1 44m 192.168.204.3 controller-0 <none> <none>
kube-scheduler-controller-0 1/1 Running 1 44m 192.168.204.3 controller-0 <none> <none>
kube-sriov-cni-ds-amd64-qm6hl 1/1 Running 1 44m 192.168.204.3 controller-0 <none> <none>
tiller-deploy-9c6df7f79-2j862 1/1 Running 1 44m 192.168.204.3 controller-0 <none> <none>

# cmd failed
[2019-08-09 14:06:49,438] 301 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-bulk-add hosts_bulk_add.xml'
[2019-08-09 14:08:51,518] 423 DEBUG MainThread ssh.expect :: Output:
Error:
 controller: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "create_barbican_secret" info: "<unknown>"
 controller: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "create_barbican_secret" info: "<unknown>"

Test Activity
-------------
Sanity

Revision history for this message
Yang Liu (yliu12) wrote :
Revision history for this message
John Kung (john-kung) wrote :

This issue could be related to change from https://bugs.launchpad.net/starlingx/+bug/1834673. At least the return code reported in bug 1834673 did not cover the rpc timeout case.

Revert request for suspect update is in Gerrit: https://review.opendev.org/#/c/675698/

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi John,

RPC timeout causes command failure should be expected action, right?
In my patch, it explicitly report command error and let user to try it again later.
If we revert this patch, we just ignore the timeout issue.

For timeout case, we need handle it in bulk_add functions, the place where we catch
the timeout exception.

So this is another issue existed all the time.
We'd better merge my original patch, and add a new patch to fix timeout case instead of reverting my patch, right?

Thanks!
Zhipeng

Revision history for this message
John Kung (john-kung) wrote :

hi Zhipeng, The errors/timeout on barbican bmc password storage needs to be handled. I would recommend resubmitting the review with the bulk_add function updated so that the bmc password timeout/error cases can be reviewed wholistically.
Thanks, John

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → John Kung (john-kung)
importance: Undecided → High
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: High → Undecided
Revision history for this message
John Kung (john-kung) wrote :

I reverted https://bugs.launchpad.net/starlingx/+bug/1834673 because it was suspect to trigger this issue, and the commit did not handle the bmc passowrd retrieval timeout case from barbican. I suspect the issue with the rpc timeout would still occur (even with the update reverted).

Investigation is required into the reason for the rpc timeout on rpcapi.create_barbican_secret() and better handling of the timeout condition to return an error code to the client.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: John Kung (john-kung) → Alex Kozyrev (akozyrev)
Ghada Khalil (gkhalil)
summary: - RPC timeout error when bulk-add-hosts after controller-0 is up
+ RPC timeout error when creating barbican secret on host-bulk-add after
+ controller-0 is up
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority given that re-running the cmd goes thru. Still this should be investigated and the root-cause determined/addressed.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Ghada Khalil (gkhalil)
tags: added: stx.3.0 stx.config
Revision history for this message
Alexander Kozyrev (akozyrev) wrote :

sysinv-conductor is stuck for 2 minutes during startup and unable to handle any RPC requests.
The refresh_helm_repo_information() function executes “sudo -u sysadmin helm repo update” and waits for it.
Helm tries to reach external https://kubernetes-charts.storage.googleapis.com and timeouts after 2 minutes.
Any command (since sysinv-api is up and allows to execute them) is expected to fail during this time window.

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Alex,

Any further proposal for this ticket?

Thanks!
Zhipeng

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/681805

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/681805
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=783ed056966344b4c3a715bae470a149798957d1
Submitter: Zuul
Branch: master

commit 783ed056966344b4c3a715bae470a149798957d1
Author: Alex Kozyrev <email address hidden>
Date: Thu Sep 12 12:07:54 2019 -0400

    Move Helm chart repo refresh from Sysinv conductor to agent

    sysinv-conductor is stuck for 2 minutes during startup and unable to
    handle any RPC requests. The refresh_helm_repo_information() function
    executes “sudo -u sysadmin helm repo update” and waits for it.
    Helm tries to reach external repo and timeouts after 2 minutes.
    Any command (since sysinv-api is up and allows to execute them)
    is expected to fail during this time window. Unblock conductor by
    moving actual "helm repo update" invokation to sysinv agent.
    Let conductor make a RPC call instead. And remove helm refresh from
    conductor init - this would be called as part of application upload.

    Change-Id: I29249146eaaebfe51a0f67e3ffaae9ce48784107
    Closes-Bug: 1839665
    Signed-off-by: Alex Kozyrev <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

Have not seen this issue in daily sanity for the past 2 weeks.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.