Brief Description
-----------------
After the standby controller of an AIODX was unlocked, sm failed to start resulting in a boot loop.
Severity
--------
Critical as multiple system tests including host lock/unlock is blocked.
Steps to Reproduce
------------------
This issue is observed in an AIODX system configured for IPv6 with various types of resource consumer pods. Before launching some SRIOV pods, the following steps were executed:
system host-label-assign controller-1 sriovdp=enabled
system host-lock controller-1
system host-if-modify -m 1500 -n sriov0 -c pci-sriov -N 4 --vf-driver=netdevice controller-1 enp26s0f1
system interface-datanetwork-assign controller-1 sriov0 group0-data0
system host-unlock controller-1
After the unlock, sm failed to bind socket to host_cluster interface with "No such device" reason though the interface existed and was up.
019-09-04T21:29:01.000 controller-1 sm: debug time[431.494] log<43> INFO: sm[119876]: sm_process.c(595): sm_process_priority value is set to -2
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.494] log<44> INFO: sm[119876]: sm_process.c(752): Started.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.529] log<45> INFO: sm[119876]: sm_msg.c(329): Message instance (4d3c7b10-9d0a-463e-9e8b-59a924e2c250) for node (controller-0) set.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.530] log<46> INFO: sm[119876]: sm_service_domain_neighbor_fsm.c(732): Neighbor (controller-0) received event (hello-message) was in the down state and is now in the exchange-start.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.531] log<47> INFO: sm[119876]: sm_service_domain_utils.c(273): Donot send exchange start too often (100 ms)
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<3> ERROR: sm_heartbeat[119883]: sm_heartbeat_msg.c(259): Failed to send message on socket for interface (enp24s0f0.163), error=Bad file descriptor.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<4> ERROR: sm_heartbeat[119883]: sm_heartbeat_thread.c(894): Failed to send alive on to ff05::3c:1 interface (enp24s0f0.163), error=FAILED.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<5> ERROR: sm_heartbeat[119883]: sm_heartbeat_msg.c(259): Failed to send message on socket for interface (eno1), error=Bad file descriptor.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<6> ERROR: sm_heartbeat[119883]: sm_heartbeat_thread.c(915): Failed to send alive on interface (eno1), error=FAILED.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.710] log<7> INFO: sm_heartbeat[119883]: sm_heartbeat_thread.c(1046): Peer node (controller-0) added.
….
….
….
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<73> ERROR: sm[119876]: sm_service_domain_interface_api.c(186): Event (node-enabled) not handled for service domain (controller) interface (cluster-host-interface), error=FAILED.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<74> ERROR: sm[119876]: sm_selobj.c(213): Select failed, error=Bad file descriptor.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<75> ERROR: sm[119876]: sm_process.c(762): Selection object dispatch failed, error=FAILED.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<76> INFO: sm[119876]: sm_process.c(857): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[432.994] log<3> INFO: sm_ta[119885]: sm_task_affining_thread.c(145): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[432.994] log<4> INFO: sm_ta[119885]: sm_task_affining_thread.c(154): Shutdown complete.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.289] log<3> INFO: sm_service_hb[119884]: sm_service_heartbeat_thread.c(203): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.289] log<4> INFO: sm_service_hb[119884]: sm_service_heartbeat_thread.c(212): Shutdown complete.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.812] log<8> ERROR: sm_heartbeat[119883]: sm_util_types.c(16): Critical error, failed to obtain the mutex. Quit...
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<3> INFO: sm_trap[119877]: sm_trap_thread.c(509): Trap connection has been broken, exiting.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<4> INFO: sm_trap[119877]: sm_trap_thread.c(672): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<5> INFO: sm_trap[119877]: sm_trap_thread.c(681): Shutdown complete.
Expected Behavior
------------------
Successful unlock
Actual Behavior
----------------
The standby controller got into a boot loop.
Reproducibility
---------------
100% reproducible
System Configuration
--------------------
AIODX, IPv6
Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190821T053000Z"
JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="221"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-21 05:30:00 +0000"
Last Pass
---------
Not aware of the last time a lock/unlock was conducted in an IPv6 AIODX system.
Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or paste.openstack.org)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem
Test Activity
-------------
System Test
Fix proposed to branch: master /review. opendev. org/680492
Review: https:/