NFV: Failed openstack API calls being quietly ignored in python3

Bug #2007285 reported by Al Bailey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Al Bailey

Bug Description

Brief Description
-----------------
For NFV (running python3) when it encounters a OpenStackRestAPI exception, it is being quietly ignored and eventually reports a timeout.

This affects traceability and error handling for NFV orchestration activities.

Easiest way to reproduce the issue is to attempt a kube upgrade to an 'older' version.

example going from v1.24.4 back to v1.21.8
system kube-version-list
+---------+--------+-------------+
| version | target | state |
+---------+--------+-------------+
| v1.21.8 | False | unavailable |
| v1.22.5 | False | unavailable |
| v1.23.1 | False | unavailable |
| v1.24.4 | True | active |
+---------+--------+-------------+

Severity
--------
Major

Steps to Reproduce
------------------
# assume already at a later version than v1.21.8
source /etc/platform/openrc
sw-manager kube-upgrade-strategy create --to-version v1.21.8
sw-manager kube-upgrade-strategy apply

Expected Behavior
------------------
It should quickly report a failure
sw-manager kube-upgrade-strategy show
Strategy Kubernetes Upgrade Strategy:
  strategy-uuid: e30abd1f-96ab-49d8-8ee0-b551df41adfd
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: serial
  default-instance-action: stop-start
  alarm-restrictions: strict
  current-phase: abort
  current-phase-completion: 100%
  state: aborted
  apply-result: failed
  apply-reason: the installed kubernetes version v1.24.4 cannot upgrade to version v1.21.8
  abort-result: success
  abort-reason:

Actual Behavior
----------------
It takes a couple of minutes and reports a failure due to 'timed out'

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
Any load running python3

Last Pass
---------
Any load running python2

Timestamp/Logs
--------------
Logs are pointless for this activity. The underlying component that causes the problem quietly discards the action. I spent weeks until I finally found where the 'stall' was occurring.

Test Activity
-------------
Feature Testing

Workaround
----------
None

Al Bailey (albailey1974)
Changed in starlingx:
assignee: nobody → Al Bailey (albailey1974)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/nfv/+/873724

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/873724
Committed: https://opendev.org/starlingx/nfv/commit/94321e9d571922a453917e80cebd0835b9bf7e40
Submitter: "Zuul (22348)"
Branch: master

commit 94321e9d571922a453917e80cebd0835b9bf7e40
Author: Al Bailey <email address hidden>
Date: Tue Feb 14 15:21:13 2023 +0000

    Debian: python3 fix for OpenStackRestAPIExceptions

    When the NFV uses tasks and futures and coroutines to
    interact with openstack APIs, an OpenStackRestAPIException
    can be returned as a task result.

    The exception needs to be 'pickled' when sent across the
    queue/socket for the 'simulated' asyncio workflow.

    However, the pickle code for that exception was broken in
    python3. It was relying on a python2 'message' attribute
    of the base Exception class to exist, which no longer
    exists (in python3)

    This was causing the pickle command to quietly fail and
    the code waiting for the task result would timeout and
    not report back the failure information.

    The fix is to ensure that there is a 'message' property
    on that exception type.

    Unit tests have been added for all the pickleable
    exceptions, to ensure their '__reduce__' and other
    interactions with 'pickle' are not reporting any failures.

    Test Plan:
     PASS: create and apply a kube-upgrade-strategy for an
     older version of kubernetes and observe it reports its
    failure error (rather than a timeout)

    Closes-Bug: #2007285
    Signed-off-by: Al Bailey <email address hidden>
    Change-Id: I3a8776163a78330810ae1097ddd1831b1b26a212

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.debian stx.nfv
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.