RPC call failure blocks host from being inventoried

Bug #1973244 reported by Heitor Matsui
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Heitor Matsui

Bug Description

Brief Description
-----------------
A new RPC call introduced on https://review.opendev.org/c/starlingx/config/+/835311 blocks controller-1 from being inventoried after the upgrade since conductor in this point is still running load n, which doesn't have this new RPC call and then agent returns an AttributeError since it fails to make the call.

Severity
--------
Critical, blocks the upgrade from proceeding

Steps to Reproduce
------------------
Upgrade controller-1
Try to unlock controller-1
Check sysinv.log and observe RPC call error messages

Expected Behavior
------------------
Host controller-1 unlocks without errors

Actual Behavior
----------------
Host controller-1 is blocked from unlocking with "not yet inventoried" error

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO DX, Standard (DC and non-DC)

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
N/A

Timestamp/Logs
--------------
sysinv 2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager [-] Sysinv Agent exception creating the host filesystems. No such RPC function 'get_isystem'
Traceback (most recent call last):

  File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 437, in _process_data
    **args)

  File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 176, in dispatch
    raise AttributeError("No such RPC function '%s'" % method)

AttributeError: No such RPC function 'get_isystem'
: AttributeError: No such RPC function 'get_isystem'
Traceback (most recent call last):

  File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 437, in _process_data
    **args)

  File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 176, in dispatch
    raise AttributeError("No such RPC function '%s'" % method)

AttributeError: No such RPC function 'get_isystem'
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager Traceback (most recent call last):
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 1164, in _create_host_filesystems
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager system = rpcapi.get_isystem(icontext)
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 189, in get_isystem
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager return self.call(context, self.make_msg('get_isystem',))
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 121, in call
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager result = rpc.call(context, real_topic, msg, timeout)
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/__init__.py", line 139, in call
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager return _get_impl().call(CONF, context, topic, msg, timeout)
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 815, in call
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager rpc_amqp.get_connection_pool(conf, Connection))
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 619, in call
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager rv = list(rv)
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 568, in __iter__
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager raise result # pylint: disable=raising-bad-type
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager AttributeError: No such RPC function 'get_isystem'
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager Traceback (most recent call last):
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/amqp.py", line 437, in _process_data
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager **args)
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/dispatcher.py", line 176, in dispatch
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager raise AttributeError("No such RPC function '%s'" % method)
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager AttributeError: No such RPC function 'get_isystem'
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager
2022-05-12 11:03:12.870 35610 ERROR sysinv.agent.manager
sysinv 2022-05-12 11:03:12.900 35610 INFO sysinv.agent.manager [-] _conditions_for_inventory_complete_met requires set(['host_filesystems'])

Test Activity
-------------
Feature Testing

Workaround
----------
Comment lines introduced on the change referenced in the description, then restart sysinv-agent with: sudo systemctl restart sysinv-agent

Changed in starlingx:
status: New → In Progress
Revision history for this message
Heitor Matsui (heitormatsui) wrote :

This RPC call is new, so calling it from load n+1 to conductor running in load n will fail, solution will be to treat this exception and allow it to pass during upgrades, since this is a expected error to happen, like it is treated in other parts of the agent/manager.py code.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/841468
Committed: https://opendev.org/starlingx/config/commit/712a33718392ac573c89aec189eb8ae473f4b080
Submitter: "Zuul (22348)"
Branch: master

commit 712a33718392ac573c89aec189eb8ae473f4b080
Author: Heitor Matsui <email address hidden>
Date: Wed May 11 17:27:33 2022 -0300

    Fix upgrade failure on agent manager

    During the upgrade, at the stage where controller-1 is upgraded
    but controller-0 is not, the controller-1 agent fails while trying
    to make a RPC call that does not exist on load N, introduced in
    the change [1].

    The commit allows these RPC call failures to pass during the
    upgrades. Also, this commit will make the agent first try
    reading the system_dc_role and system_type parameters from
    the local platform.conf file, using the RPC call only when
    it doesn't find the system_dc_role on the file.

    [1] https://review.opendev.org/c/starlingx/config/+/835311

    Test Plan:
    PASS: Upgrade AIO DX successfully
    PASS: Upgrade Standard successfully

    Closes-bug: 1973244
    Change-Id: I457b4c2ceea7bab13923d2a8840e97b82d39da2a
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Changed in starlingx:
assignee: nobody → Heitor Matsui (heitormatsui)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.