StarlingX

AIODX (IPv6): Stand-by controller is in boot loop after unlock

Bug #1842949 reported by Tee Ngo on 2019-09-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Bin Qian

Bug Description

Brief Description
-----------------
After the standby controller of an AIODX was unlocked, sm failed to start resulting in a boot loop.

Severity
--------
Critical as multiple system tests including host lock/unlock is blocked.

Steps to Reproduce
------------------
This issue is observed in an AIODX system configured for IPv6 with various types of resource consumer pods. Before launching some SRIOV pods, the following steps were executed:

system host-label-assign controller-1 sriovdp=enabled
system host-lock controller-1
system host-if-modify -m 1500 -n sriov0 -c pci-sriov -N 4 --vf-driver=netdevice controller-1 enp26s0f1
system interface-datanetwork-assign controller-1 sriov0 group0-data0
system host-unlock controller-1

After the unlock, sm failed to bind socket to host_cluster interface with "No such device" reason though the interface existed and was up.

019-09-04T21:29:01.000 controller-1 sm: debug time[431.494] log<43> INFO: sm[119876]: sm_process.c(595): sm_process_priority value is set to -2
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.494] log<44> INFO: sm[119876]: sm_process.c(752): Started.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.529] log<45> INFO: sm[119876]: sm_msg.c(329): Message instance (4d3c7b10-9d0a-463e-9e8b-59a924e2c250) for node (controller-0) set.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.530] log<46> INFO: sm[119876]: sm_service_domain_neighbor_fsm.c(732): Neighbor (controller-0) received event (hello-message) was in the down state and is now in the exchange-start.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.531] log<47> INFO: sm[119876]: sm_service_domain_utils.c(273): Donot send exchange start too often (100 ms)
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<3> ERROR: sm_heartbeat[119883]: sm_heartbeat_msg.c(259): Failed to send message on socket for interface (enp24s0f0.163), error=Bad file descriptor.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<4> ERROR: sm_heartbeat[119883]: sm_heartbeat_thread.c(894): Failed to send alive on to ff05::3c:1 interface (enp24s0f0.163), error=FAILED.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<5> ERROR: sm_heartbeat[119883]: sm_heartbeat_msg.c(259): Failed to send message on socket for interface (eno1), error=Bad file descriptor.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.551] log<6> ERROR: sm_heartbeat[119883]: sm_heartbeat_thread.c(915): Failed to send alive on interface (eno1), error=FAILED.
2019-09-04T21:29:01.000 controller-1 sm: debug time[431.710] log<7> INFO: sm_heartbeat[119883]: sm_heartbeat_thread.c(1046): Peer node (controller-0) added.

….
….
….
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<73> ERROR: sm[119876]: sm_service_domain_interface_api.c(186): Event (node-enabled) not handled for service domain (controller) interface (cluster-host-interface), error=FAILED.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<74> ERROR: sm[119876]: sm_selobj.c(213): Select failed, error=Bad file descriptor.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<75> ERROR: sm[119876]: sm_process.c(762): Selection object dispatch failed, error=FAILED.
2019-09-04T21:29:02.000 controller-1 sm: debug time[432.497] log<76> INFO: sm[119876]: sm_process.c(857): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[432.994] log<3> INFO: sm_ta[119885]: sm_task_affining_thread.c(145): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[432.994] log<4> INFO: sm_ta[119885]: sm_task_affining_thread.c(154): Shutdown complete.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.289] log<3> INFO: sm_service_hb[119884]: sm_service_heartbeat_thread.c(203): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.289] log<4> INFO: sm_service_hb[119884]: sm_service_heartbeat_thread.c(212): Shutdown complete.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.812] log<8> ERROR: sm_heartbeat[119883]: sm_util_types.c(16): Critical error, failed to obtain the mutex. Quit...
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<3> INFO: sm_trap[119877]: sm_trap_thread.c(509): Trap connection has been broken, exiting.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<4> INFO: sm_trap[119877]: sm_trap_thread.c(672): Shutting down.
2019-09-04T21:29:03.000 controller-1 sm: debug time[433.813] log<5> INFO: sm_trap[119877]: sm_trap_thread.c(681): Shutdown complete.

Expected Behavior
------------------
Successful unlock

Actual Behavior
----------------
The standby controller got into a boot loop.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
AIODX, IPv6

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190821T053000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="221"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-21 05:30:00 +0000"

Last Pass
---------
Not aware of the last time a lock/unlock was conducted in an IPv6 AIODX system.

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or paste.openstack.org)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
System Test

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-09-05:

unlock-sm-failure-logs.tgz Edit (1.2 MiB, application/x-tar)

summary:

- Stand-by controller is in boot loop after unlock
+ AIODX (IPv6): Stand-by controller is in boot loop after unlock

Bin Qian (bqian20) on 2019-09-05

Changed in starlingx:
assignee:	nobody → Bin Qian (bqian20)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-05: Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/680492

Changed in starlingx:
status:	New → In Progress

Ghada Khalil (gkhalil) on 2019-09-06

tags:

added: stx.ha

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-06:

Marking as stx.3.0 / high priority - issue bringing up a duplex system w/ IPv6

Changed in starlingx:
importance:	Undecided → High
tags:	added: stx.3.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-10: Fix merged to ha (master)

Reviewed: https://review.opendev.org/680492
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=7f52df37bd57916406ecac0f7fe6064129f3a38b
Submitter: Zuul
Branch: master

commit 7f52df37bd57916406ecac0f7fe6064129f3a38b
Author: Bin Qian <email address hidden>
Date: Thu Sep 5 13:33:52 2019 -0400

Fix IPv6 standby controller boot loop

IPv6 multicast should be sent to the interface that the socket
binds to.

    Closes-Bug: 1842949
    Change-Id: I14b6c5193c67a0ddd69e31d1044219c4e9fd6b94
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

unlock-sm-failure-logs.tgz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.