StarlingX

Failed to upgrade storage node

Bug #1954695 reported by Vinicius Lopes da Silva on 2021-12-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Vinicius Lopes da Silva

Bug Description

Brief Description
-----------------
During the upgrade of a storage node, it failed.

Severity
--------
Major

Steps to Reproduce
------------------
Install 2+2+2 using STX.5.0
Upgrade from STX.5.0 to STX.6.0
After upgrading both controllers, lock and upgrade storage node.

Expected Behavior
------------------
It should upgrade storage node

Actual Behavior
----------------
Failed to upgrade storage node

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2021-12-06_23-00-09"

Last Pass
---------
N/A

Timestamp/Logs
--------------
sysinv 2021-12-08 01:10:29.476 907247 ERROR wsme.api [-] Server-side error: "Command 'ceph restful list-keys --connect-timeout 5' returned non-zero exit status 1". Detail:
Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
result = f(self, *args, **kwargs)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 2695, in upgrade
osd_status = self._ceph.check_osds_down_up(rpc_ihost.hostname, True)

File "/usr/lib64/python2.7/site-packages/sysinv/common/ceph.py", line 707, in check_osds_down_up
response, body = self._ceph_api.osd_tree(body='json')

File "/usr/lib/python2.7/site-packages/cephclient/client.py", line 2555, in osd_tree
return self._request('osd tree', **kwargs)

File "/usr/lib/python2.7/site-packages/cephclient/client.py", line 190, in _request
self._get_password()

File "/usr/lib/python2.7/site-packages/cephclient/client.py", line 88, in _get_password
shell=True)

File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)

CalledProcessError: Command 'ceph restful list-keys --connect-timeout 5' returned non-zero exit status 1
: CalledProcessError: Command 'ceph restful list-keys --connect-timeout 5' returned non-zero exit status 1

Test Activity
-------------
Regression Testing

Workaround
----------
Retry upgrade in the same storage node

Tags:

Revision history for this message

Vinicius Lopes da Silva (viniciuslopesdasilva) wrote on 2021-12-13:

logs.zip Edit (271.8 KiB, application/zip)

Logs attached

Revision history for this message

Vinicius Lopes da Silva (viniciuslopesdasilva) wrote on 2021-12-13:

The exception appears at 2021-12-08 01:10:29.476. Since the error appears after a timeout of 5 seconds(ceph restful list-keys --connect-timeout 5), the command was probably issued at 01:10:24.476. At this time, monitor apparently wasn't answering requests due to an election in course.

Trying again should allow the storage to be upgraded.

As a fix for this, an increase in the timeout for ceph CLI commands should be enough

Bob Church (rchurch) on 2021-12-13

Changed in starlingx:
assignee:	nobody → Vinicius Lopes da Silva (viniciuslopesdasilva)

Revision history for this message

Al Bailey (albailey1974) wrote on 2021-12-14:

https://review.opendev.org/c/starlingx/utilities/+/821436

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-14: Fix merged to utilities (r/stx.6.0)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/821436
Committed: https://opendev.org/starlingx/utilities/commit/f45bd562d3b860bee54c6ff1db9688405c75c3c9
Submitter: "Zuul (22348)"
Branch: r/stx.6.0

commit f45bd562d3b860bee54c6ff1db9688405c75c3c9
Author: Vinicius Lopes da Silva <email address hidden>
Date: Thu Dec 9 16:57:51 2021 -0300

Increase timeout value for ceph CLI

    When a monitor election is in process and ceph CLI commands are issued,
    they might hang until the end of the election occurs. Sometimes the
    election is quick and sometimes it might take up until 11s.
    Since current code awaits up until 5s, sometimes an error appears in
    STX due to timeout.

The timeout value needs to increase, I believe 15s of timeout is enough
for most cases.

    Testing performed:
    2+2:
    Set up two scripts making requests to ceph. One had a small timeout
    and another one a big timeout.
    Started both scripts.
    Locked storage-0.
    Script having small timeout showed error and the other one didn't.

    Story: 2009074
    Task: 44159
    Closes-Bug: #1954695

    Signed-off-by: Vinicius Lopes da Silva <email address hidden>
    Change-Id: I618ed74c8f63ca77caa0893016f90b67fdfee7e0
    (cherry picked from commit 41e65d98f1379d6a09fb01a7c9d6d863f68c5ab6)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-12-14 (last edit on 2021-12-14):

Marking as Fix Released. I see the reviews in both stx master and r/stx.6.0 are merged.

tags:	added: in-r-stx60 stx.6.0 stx.storage
tags:	added: stx.update
Changed in starlingx:
status:	New → In Progress
importance:	Undecided → High
status:	In Progress → Fix Released