Failed to upgrade storage node

Bug #1954695 reported by Vinicius Lopes da Silva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Vinicius Lopes da Silva

Bug Description

Brief Description
-----------------
During the upgrade of a storage node, it failed.

Severity
--------
Major

Steps to Reproduce
------------------
Install 2+2+2 using STX.5.0
Upgrade from STX.5.0 to STX.6.0
After upgrading both controllers, lock and upgrade storage node.

Expected Behavior
------------------
It should upgrade storage node

Actual Behavior
----------------
Failed to upgrade storage node

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2021-12-06_23-00-09"

Last Pass
---------
N/A

Timestamp/Logs
--------------
sysinv 2021-12-08 01:10:29.476 907247 ERROR wsme.api [-] Server-side error: "Command 'ceph restful list-keys --connect-timeout 5' returned non-zero exit status 1". Detail:
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 2695, in upgrade
    osd_status = self._ceph.check_osds_down_up(rpc_ihost.hostname, True)

  File "/usr/lib64/python2.7/site-packages/sysinv/common/ceph.py", line 707, in check_osds_down_up
    response, body = self._ceph_api.osd_tree(body='json')

  File "/usr/lib/python2.7/site-packages/cephclient/client.py", line 2555, in osd_tree
    return self._request('osd tree', **kwargs)

  File "/usr/lib/python2.7/site-packages/cephclient/client.py", line 190, in _request
    self._get_password()

  File "/usr/lib/python2.7/site-packages/cephclient/client.py", line 88, in _get_password
    shell=True)

  File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
    raise CalledProcessError(retcode, cmd, output=output)

CalledProcessError: Command 'ceph restful list-keys --connect-timeout 5' returned non-zero exit status 1
: CalledProcessError: Command 'ceph restful list-keys --connect-timeout 5' returned non-zero exit status 1

Test Activity
-------------
Regression Testing

Workaround
----------
Retry upgrade in the same storage node

Revision history for this message
Vinicius Lopes da Silva (viniciuslopesdasilva) wrote :

Logs attached

Revision history for this message
Vinicius Lopes da Silva (viniciuslopesdasilva) wrote :

The exception appears at 2021-12-08 01:10:29.476. Since the error appears after a timeout of 5 seconds(ceph restful list-keys --connect-timeout 5), the command was probably issued at 01:10:24.476. At this time, monitor apparently wasn't answering requests due to an election in course.

Trying again should allow the storage to be upgraded.

As a fix for this, an increase in the timeout for ceph CLI commands should be enough

Bob Church (rchurch)
Changed in starlingx:
assignee: nobody → Vinicius Lopes da Silva (viniciuslopesdasilva)
Revision history for this message
Al Bailey (albailey1974) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (r/stx.6.0)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/821436
Committed: https://opendev.org/starlingx/utilities/commit/f45bd562d3b860bee54c6ff1db9688405c75c3c9
Submitter: "Zuul (22348)"
Branch: r/stx.6.0

commit f45bd562d3b860bee54c6ff1db9688405c75c3c9
Author: Vinicius Lopes da Silva <email address hidden>
Date: Thu Dec 9 16:57:51 2021 -0300

    Increase timeout value for ceph CLI

    When a monitor election is in process and ceph CLI commands are issued,
    they might hang until the end of the election occurs. Sometimes the
    election is quick and sometimes it might take up until 11s.
    Since current code awaits up until 5s, sometimes an error appears in
    STX due to timeout.

    The timeout value needs to increase, I believe 15s of timeout is enough
    for most cases.

    Testing performed:
    2+2:
    Set up two scripts making requests to ceph. One had a small timeout
    and another one a big timeout.
    Started both scripts.
    Locked storage-0.
    Script having small timeout showed error and the other one didn't.

    Story: 2009074
    Task: 44159
    Closes-Bug: #1954695

    Signed-off-by: Vinicius Lopes da Silva <email address hidden>
    Change-Id: I618ed74c8f63ca77caa0893016f90b67fdfee7e0
    (cherry picked from commit 41e65d98f1379d6a09fb01a7c9d6d863f68c5ab6)

Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

Marking as Fix Released. I see the reviews in both stx master and r/stx.6.0 are merged.

tags: added: in-r-stx60 stx.6.0 stx.storage
tags: added: stx.update
Changed in starlingx:
status: New → In Progress
importance: Undecided → High
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.