StarlingX

DC: Worker node went for second reboot after all nodes power down/up

Bug #1884335 reported by Nimalini Rasa on 2020-06-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Medium	Nimalini Rasa

Bug Description

Brief Description
-----------------
Worker node went for second reboot after Power down power up all nodes (AIO-DX+ with one worker) system.

Error : worker-1 {"state-change": {"administrative":"unlocked","operational":"disabled","availability":"failed","subfunction_oper":"disabled","subfunction_avail":"not-installed"},"hostname":"worker-1","uuid":"4636d058-59a8-4cef-8e4c-c793471772b1","subfunctions":"worker","personality":"worker"}

Severity
--------
Major

Steps to Reproduce
------------------
Power down all nodes and powe up all nodes

Expected Behavior
------------------
Worker node to come up without going for the second reboot

Actual Behavior
----------------
It failed to come up and went for reboot

Reproducibility
---------------
Seen 2/2 times

System Configuration
--------------------
DC system controllers:Two node system with worker
Subclouds: One node system
IPv6

Branch/Pull Time/Commit
-----------------------
2020-06-11_20-00-00

Last Pass
---------
N/A

Timestamp/Logs
--------------
DOR all nodes:Fri Jun 19 21:02:04 EDT 2020 (off)
Fri Jun 19 21:02:51 EDT 2020 (On)
2020-06-20T01:10:24.582 [29529.00110] controller-1 mtcAgent mtc mtcNodeCtrl.cpp (1540) daemon_service_run : Info : controller-1 is ACTIVE ; DOR Recovery 5:28 mins ( 328 secs) (duration 600 secs)
2020-06-20T01:10:55.531 [29529.00351] controller-1 mtcAgent |-| nodeClass.cpp (7576) report_dor_recovery : Info : controller-0 is ENABLED ; DOR Recovery 5:59 mins ( 359 secs) (uptime: 5:53 mins)
2020-06-20T01:26:37.066 [29529.00382] controller-1 mtcAgent |-| nodeClass.cpp (7576) report_dor_recovery : Info : worker-1 is ENABLED ; DOR Recovery 21:40 mins (1300 secs) (uptime: 2:07 mins)

Test Activity
-------------
System Test

Workaround
----------
N/A

See original description

Tags:

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-06-20:

Log files:https://bugs.launchpad.net/bugs/1884335

Nimalini Rasa (nrasa) on 2020-06-20

description:

updated

Revision history for this message

Yang Liu (yliu12) wrote on 2020-06-20:

Link to logs: https://files.starlingx.kube.cengn.ca/launchpad/1884335

Revision history for this message

Brent Rowsell (brent-rowsell) wrote on 2020-06-20:

It appears there were issues accessing the dns server on the controller and worker_config eventually timed out.

2020-06-20T01:08:43.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:09:03.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:09:23.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:09:25.000 worker-1 NFSCHECK: notice /opt/platform is not mounted
2020-06-20T01:09:43.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:10:03.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:10:23.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:10:25.000 worker-1 NFSCHECK: notice /opt/platform is not mounted
2020-06-20T01:18:48.000 worker-1 /etc/init.d/worker_config: warning DNS query failed after max retries for worker-1 (620 secs)
2020-06-20T01:18:48.000 worker-1 root: notice Error: Unable to get IP from host: worker-1

It appears the active controller (controller-1) services had recovered at 1:10

Local interface was brought up here

2020-06-20T01:05:21.134 worker-1 kernel: info [ 17.210897] ip6_tables: (C) 2000-2006 Netfilter Core Team
2020-06-20T01:05:21.439 worker-1 kernel: info [ 17.515671] IPv6: ADDRCONF(NETDEV_UP): enp24s0f0: link is not ready
2020-06-20T01:05:21.439 worker-1 kernel: info [ 17.515680] IPv6: ADDRCONF(NETDEV_CHANGE): enp24s0f0: link becomes ready
2020-06-20T01:05:21.520 worker-1 kernel: info [ 17.596341] 8021q: 802.1Q VLAN Support v1.8
2020-06-20T01:05:21.520 worker-1 kernel: info [ 17.596347] 8021q: adding VLAN 0 to HW filter on device enp24s0f0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-22:

stx.5.0 / medium priority - node still recovers, but needs an extra reboot. Should be investigated, but will not hold stx.4.0

summary:	- DC:Worker node went for second reboot after all nodes power down/up + DC: Worker node went for second reboot after all nodes power down/up
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.5.0 stx.config

Ghada Khalil (gkhalil) on 2020-06-28

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-09-03:

@Nimalini Rasa, Is this issue reproducible?

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2020-09-03:

I did not try the test again to see if it is happening again.

Revision history for this message

Bart Wensley (bartwensley) wrote on 2021-03-24:

The logs in Brent's comment above seem to indicate that there was an 8 minute delay in the worker_config script before it failed:
2020-06-20T01:10:23.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:18:48.000 worker-1 /etc/init.d/worker_config: warning DNS query failed after max retries for worker-1 (620 secs)

This could indicate that the "dig" command hung, which seems to be a common issue. If that was the case, we could improve robustness by wrapping the "dig" command with "timeout" command to ensure this didn't happen.

However, the log files at https://files.starlingx.kube.cengn.ca/launchpad/1884335 are no longer available, so I can't confirm whether there were other logs in between the above two logs.

Setting the LP to incomplete and assigning to the originator to see if the logs might still be available. If we don't have the logs, we'll have to close this one until it can be reproduced and logs collected.

Changed in starlingx:
status:	Triaged → Incomplete
assignee:	nobody → Nimalini Rasa (nrasa)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-04-22:

As discussed with Nimalini, closing since the logs are not available and the investigation can't move forward. This LP is ~9 months old. A new LP with a fresh set of logs should be opened if the issue is reproduced with a more recent image.

Changed in starlingx:
status:	Incomplete → Won't Fix

Ghada Khalil (gkhalil) on 2021-04-22

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.