AIODX (IPv6): Stand-by controller is in boot loop after unlock

Bug #1842949 reported by Tee Ngo
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Qian

Bug Description

Brief Description
-----------------
After the standby controller of an AIODX was unlocked, sm failed to start resulting in a boot loop.

Severity
--------
Critical as multiple system tests including host lock/unlock is blocked.

Steps to Reproduce
------------------
This issue is observed in an AIODX system configured for IPv6 with various types of resource consumer pods. Before launching some SRIOV pods, the following steps were executed:

system host-label-assign controller-1 sriovdp=enabled
system host-lock controller-1
system host-if-modify -m 1500 -n sriov0 -c pci-sriov -N 4 --vf-driver=netdevice controller-1 enp26s0f1
system interface-datanetwork-assign controller-1 sriov0 group0-data0
system host-unlock controller-1

After the unlock, sm failed to bind socket to host_cluster interface with "No such device" reason though the interface existed and was up.

019-09-04T21:29:01.000 controller-1 sm: debug time[431.494] log<43> INFO: sm[119876]: sm_process.c(595): sm_process_priority value is set to -2
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.494] log<44> INFO: sm[119876]: sm_process.c(752): Started.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.529] log<45> INFO: sm[119876]: sm_msg.c(329): Message instance (4d3c7b10-9d0a-463e-9e8b-59a924e2c250) for node (controller-0) set.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.530] log<46> INFO: sm[119876]: sm_service_domain_neighbor_fsm.c(732): Neighbor (controller-0) received event (hello-message) was in the down state and is now in the exchange-start.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.531] log<47> INFO: sm[119876]: sm_service_domain_utils.c(273): Donot send exchange start too often (100 ms)
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<3> ERROR: sm_heartbeat[119883]: sm_heartbeat_msg.c(259): Failed to send message on socket for interface (enp24s0f0.163), error=Bad file descriptor.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<4> ERROR: sm_heartbeat[119883]: sm_heartbeat_thread.c(894): Failed to send alive on to ff05::3c:1 interface (enp24s0f0.163), error=FAILED.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<5> ERROR: sm_heartbeat[119883]: sm_heartbeat_msg.c(259): Failed to send message on socket for interface (eno1), error=Bad file descriptor.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<6> ERROR: sm_heartbeat[119883]: sm_heartbeat_thread.c(915): Failed to send alive on interface (eno1), error=FAILED.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.710] log<7> INFO: sm_heartbeat[119883]: sm_heartbeat_thread.c(1046): Peer node (controller-0) added.

….
….
….
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<73> ERROR: sm[119876]: sm_service_domain_interface_api.c(186): Event (node-enabled) not handled for service domain (controller) interface (cluster-host-interface), error=FAILED.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<74> ERROR: sm[119876]: sm_selobj.c(213): Select failed, error=Bad file descriptor.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<75> ERROR: sm[119876]: sm_process.c(762): Selection object dispatch failed, error=FAILED.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<76> INFO: sm[119876]: sm_process.c(857): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[432.994] log<3> INFO: sm_ta[119885]: sm_task_affining_thread.c(145): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[432.994] log<4> INFO: sm_ta[119885]: sm_task_affining_thread.c(154): Shutdown complete.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.289] log<3> INFO: sm_service_hb[119884]: sm_service_heartbeat_thread.c(203): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.289] log<4> INFO: sm_service_hb[119884]: sm_service_heartbeat_thread.c(212): Shutdown complete.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.812] log<8> ERROR: sm_heartbeat[119883]: sm_util_types.c(16): Critical error, failed to obtain the mutex. Quit...
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<3> INFO: sm_trap[119877]: sm_trap_thread.c(509): Trap connection has been broken, exiting.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<4> INFO: sm_trap[119877]: sm_trap_thread.c(672): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<5> INFO: sm_trap[119877]: sm_trap_thread.c(681): Shutdown complete.

Expected Behavior
------------------
Successful unlock

Actual Behavior
----------------
The standby controller got into a boot loop.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
AIODX, IPv6

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190821T053000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="221"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-21 05:30:00 +0000"

Last Pass
---------
Not aware of the last time a lock/unlock was conducted in an IPv6 AIODX system.

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or paste.openstack.org)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
System Test

Tags: stx.3.0 stx.ha
Revision history for this message
Tee Ngo (teewrs) wrote :
summary: - Stand-by controller is in boot loop after unlock
+ AIODX (IPv6): Stand-by controller is in boot loop after unlock
Bin Qian (bqian20)
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/680492

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.ha
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / high priority - issue bringing up a duplex system w/ IPv6

Changed in starlingx:
importance: Undecided → High
tags: added: stx.3.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/680492
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=7f52df37bd57916406ecac0f7fe6064129f3a38b
Submitter: Zuul
Branch: master

commit 7f52df37bd57916406ecac0f7fe6064129f3a38b
Author: Bin Qian <email address hidden>
Date: Thu Sep 5 13:33:52 2019 -0400

    Fix IPv6 standby controller boot loop

    IPv6 multicast should be sent to the interface that the socket
    binds to.

    Closes-Bug: 1842949
    Change-Id: I14b6c5193c67a0ddd69e31d1044219c4e9fd6b94
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.