Patching manifest hung on system with mgmt LAG configuration

Bug #1798093 reported by Ghada Khalil
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Don Penney

Bug Description

Brief Description
-----------------
sw-patch-controller.service is hung on an rsync that launched and connected to “controller” just 4 seconds before networking brought up the bonded vlan management interface, effectively killing the connection but leaving it in an ESTABLISHED state.

This issue was only observed once on one system with a mgmt network LAG configuration.

Severity
--------
Major, but hard to reproduce

Steps to Reproduce
------------------
When the issue was encountered, the user was applying a patch and attempting to unlock the controller.
The patch operation hung. However, it's not likely that this can be readily reproduced.

Expected Behavior
------------------
The patching operation passes and the unlock succeeds

Actual Behavior
----------------
The patching operation hung

Reproducibility
---------------
Highly intermittent

System Configuration
--------------------
Multi-node system - baremetal

Branch/Pull Time/Commit
-----------------------
Any starlingx load

Timestamp/Logs
--------------
System seems to hang at the following stage of patching.pp.log
018-10-01T20:24:32.305 Debug: 2018-10-01 20:24:32 +0000 /Stage[main]/Patching::Api/Patching_config[keystone_authtoken/auth_admin_prefix]: Nothing to manage: no ensure and the resource doesn't exist
2018-10-01T20:24:32.307 Debug: 2018-10-01 20:24:32 +0000 /Stage[main]/Patching::Api/Patching_config[keystone_authtoken/auth_version]: Nothing to manage: no ensure and the resource doesn't exist
2018-10-01T20:24:32.318 Debug: 2018-10-01 20:24:32 +0000 Executing '/usr/bin/systemctl is-active sw-patch-agent.service'
2018-10-01T20:24:32.325 Debug: 2018-10-01 20:24:32 +0000 Executing '/usr/bin/systemctl is-enabled sw-patch-agent.service'
2018-10-01T20:24:32.332 Debug: 2018-10-01 20:24:32 +0000 Executing '/usr/bin/systemctl start sw-patch-agent.service'

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Targeting stx.2019.03 as this issue is highly intermittent (only seen once).

Changed in starlingx:
assignee: nobody → Don Penney (dpenney)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.03 stx.update
Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-update (master)

Fix proposed to branch: master
Review: https://review.openstack.org/642525

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-update (master)

Reviewed: https://review.openstack.org/642525
Committed: https://git.openstack.org/cgit/openstack/stx-update/commit/?id=f4f70432592b9618e6966ebd357775b580400880
Submitter: Zuul
Branch: master

commit f4f70432592b9618e6966ebd357775b580400880
Author: Don Penney <email address hidden>
Date: Thu Mar 7 10:02:57 2019 -0500

    Change service dependencies to network-online.target

    Use network-online.target rather than network.target for
    patching services dependencies. This is to ensure that
    all interfaces, such as LAG for example, are up before
    the patch services attempt any communications to esnure
    there are no unrecoverable disruptions.

    Change-Id: I7a20e6faaa6c9fffee0636cc5cb474b98dc88253
    Closes-Bug: 1798093
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.