StarlingX

Bug #1845393
Activity log

Activity log for bug #1845393

Date	Who	What changed	Old value	New value	Message
2019-09-25 21:38:16	Tee Ngo	bug			added bug
2019-09-25 21:38:16	Tee Ngo	attachment added		failover_issue.tgz https://bugs.launchpad.net/bugs/1845393/+attachment/5291303/+files/failover_issue.tgz
2019-09-25 21:39:57	Tee Ngo	description	Brief Description ----------------- After a switch restart, the controllers is unable to establish active-standby relationship. When the switch got restarted; both controllers experienced heartbeat loss on both mgmt and cluster-host interfaces. Before making the fail-over decision; these interfaces went down but the peer update had not been received so both controllers think its peer is more healthy. Severity -------- Critical Steps to Reproduce ------------------ While the system is up with all nodes unlocked-enabled-availabled, restart the switch. Expected Behavior ------------------ Communications are restored after switch restart. All nodes are unlocked-enabled-available. Horizon is accessible via floating IP. Actual Behavior ---------------- Access to floating IP is lost. Each controller is in standalone state. Can only ssh into each controller using node IP. Worker nodes are inaccessible. Reproducibility --------------- Only tried restarting the switch once an in attempt to resolve the periodic state transition of the bond slaves from ACTIVE to BACKUP that might have caused Multinode Failure Avoidance in the lab. System Configuration -------------------- IPv6 Standard system Branch/Pull Time/Commit ----------------------- SW_VERSION="19.10" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="2019-09-17_20-00-00" SRC_BUILD_ID="15" Last Pass --------- N/A Timestamp/Logs -------------- See sm.log attached Test Activity ------------- System Test	Brief Description ----------------- After a switch restart, the controllers are unable to establish active-standby relationship. When the switch got restarted; both controllers experienced heartbeat loss on both mgmt and cluster-host interfaces. Before making the fail-over decision; these interfaces went down but the peer update had not been received so both controllers think its peer is more healthy. Severity -------- Critical Steps to Reproduce ------------------ While the system is up with all nodes unlocked-enabled-availabled, restart the switch. Expected Behavior ------------------ Communications are restored after switch restart. All nodes are unlocked-enabled-available. Horizon is accessible via floating IP. Actual Behavior ---------------- Access to floating IP is lost. Each controller is in standalone state. Can only ssh into each controller using node IP. Worker nodes are inaccessible. Reproducibility --------------- Only tried restarting the switch once an in attempt to resolve the periodic state transition of the bond slaves from ACTIVE to BACKUP that might have caused Multinode Failure Avoidance in the lab. System Configuration -------------------- IPv6 Standard system Branch/Pull Time/Commit ----------------------- SW_VERSION="19.10" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="2019-09-17_20-00-00" SRC_BUILD_ID="15" Last Pass --------- N/A Timestamp/Logs -------------- See sm.log attached Test Activity ------------- System Test
2019-09-26 15:33:17	Frank Miller	starlingx: assignee		Bin Qian (bqian20)
2019-09-26 16:12:30	Ghada Khalil	tags		stx.ha
2019-09-26 16:13:32	Ghada Khalil	description	Brief Description ----------------- After a switch restart, the controllers are unable to establish active-standby relationship. When the switch got restarted; both controllers experienced heartbeat loss on both mgmt and cluster-host interfaces. Before making the fail-over decision; these interfaces went down but the peer update had not been received so both controllers think its peer is more healthy. Severity -------- Critical Steps to Reproduce ------------------ While the system is up with all nodes unlocked-enabled-availabled, restart the switch. Expected Behavior ------------------ Communications are restored after switch restart. All nodes are unlocked-enabled-available. Horizon is accessible via floating IP. Actual Behavior ---------------- Access to floating IP is lost. Each controller is in standalone state. Can only ssh into each controller using node IP. Worker nodes are inaccessible. Reproducibility --------------- Only tried restarting the switch once an in attempt to resolve the periodic state transition of the bond slaves from ACTIVE to BACKUP that might have caused Multinode Failure Avoidance in the lab. System Configuration -------------------- IPv6 Standard system Branch/Pull Time/Commit ----------------------- SW_VERSION="19.10" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="2019-09-17_20-00-00" SRC_BUILD_ID="15" Last Pass --------- N/A Timestamp/Logs -------------- See sm.log attached Test Activity ------------- System Test	Brief Description ----------------- After a switch restart, the controllers are unable to establish active-standby relationship. When the switch got restarted; both controllers experienced heartbeat loss on both mgmt and cluster-host interfaces. Before making the fail-over decision; these interfaces went down but the peer update had not been received so both controllers think its peer is more healthy. Severity -------- Critical Steps to Reproduce ------------------ While the system is up with all nodes unlocked-enabled-availabled, restart the switch. Expected Behavior ------------------ Communications are restored after switch restart. All nodes are unlocked-enabled-available. Horizon is accessible via floating IP. Actual Behavior ---------------- Access to floating IP is lost. Each controller is in standalone state. Can only ssh into each controller using node IP. Worker nodes are inaccessible. Reproducibility --------------- Only tried restarting the switch once an in attempt to resolve the periodic state transition of the bond slaves from ACTIVE to BACKUP that might have caused Multinode Failure Avoidance in the lab. System Configuration -------------------- IPv6 Standard system Branch/Pull Time/Commit ----------------------- master BUILD_ID="2019-09-17_20-00-00" Last Pass --------- N/A Timestamp/Logs -------------- See sm.log attached Test Activity ------------- System Test
2019-09-27 18:37:10	Ghada Khalil	bug			added subscriber Bill Zvonar
2019-09-27 18:37:17	Ghada Khalil	starlingx: importance	Undecided	Medium
2019-09-27 18:37:20	Ghada Khalil	starlingx: status	New	Triaged
2019-09-27 18:37:27	Ghada Khalil	tags	stx.ha	stx.3.0 stx.ha
2019-09-30 14:56:38	Bin Qian	starlingx: status	Triaged	In Progress
2019-10-08 18:27:52	OpenStack Infra	starlingx: status	In Progress	Fix Released