DC: Worker node went for second reboot after all nodes power down/up

Bug #1884335 reported by Nimalini Rasa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Medium
Nimalini Rasa

Bug Description

Brief Description
-----------------
Worker node went for second reboot after Power down power up all nodes (AIO-DX+ with one worker) system.

Error : worker-1 {"state-change": {"administrative":"unlocked","operational":"disabled","availability":"failed","subfunction_oper":"disabled","subfunction_avail":"not-installed"},"hostname":"worker-1","uuid":"4636d058-59a8-4cef-8e4c-c793471772b1","subfunctions":"worker","personality":"worker"}

Severity
--------
Major

Steps to Reproduce
------------------
Power down all nodes and powe up all nodes

Expected Behavior
------------------
Worker node to come up without going for the second reboot

Actual Behavior
----------------
It failed to come up and went for reboot

Reproducibility
---------------
Seen 2/2 times

System Configuration
--------------------
DC system controllers:Two node system with worker
Subclouds: One node system
IPv6

Branch/Pull Time/Commit
-----------------------
2020-06-11_20-00-00

Last Pass
---------
N/A

Timestamp/Logs
--------------
DOR all nodes:Fri Jun 19 21:02:04 EDT 2020 (off)
Fri Jun 19 21:02:51 EDT 2020 (On)
2020-06-20T01:10:24.582 [29529.00110] controller-1 mtcAgent mtc mtcNodeCtrl.cpp (1540) daemon_service_run : Info : controller-1 is ACTIVE ; DOR Recovery 5:28 mins ( 328 secs) (duration 600 secs)
2020-06-20T01:10:55.531 [29529.00351] controller-1 mtcAgent |-| nodeClass.cpp (7576) report_dor_recovery : Info : controller-0 is ENABLED ; DOR Recovery 5:59 mins ( 359 secs) (uptime: 5:53 mins)
2020-06-20T01:26:37.066 [29529.00382] controller-1 mtcAgent |-| nodeClass.cpp (7576) report_dor_recovery : Info : worker-1 is ENABLED ; DOR Recovery 21:40 mins (1300 secs) (uptime: 2:07 mins)

Test Activity
-------------
System Test

Workaround
----------
N/A

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Nimalini Rasa (nrasa)
description: updated
Revision history for this message
Yang Liu (yliu12) wrote :
Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

It appears there were issues accessing the dns server on the controller and worker_config eventually timed out.

2020-06-20T01:08:43.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:09:03.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:09:23.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:09:25.000 worker-1 NFSCHECK: notice /opt/platform is not mounted
2020-06-20T01:09:43.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:10:03.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:10:23.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:10:25.000 worker-1 NFSCHECK: notice /opt/platform is not mounted
2020-06-20T01:18:48.000 worker-1 /etc/init.d/worker_config: warning DNS query failed after max retries for worker-1 (620 secs)
2020-06-20T01:18:48.000 worker-1 root: notice Error: Unable to get IP from host: worker-1

It appears the active controller (controller-1) services had recovered at 1:10

Local interface was brought up here

2020-06-20T01:05:21.134 worker-1 kernel: info [ 17.210897] ip6_tables: (C) 2000-2006 Netfilter Core Team
2020-06-20T01:05:21.439 worker-1 kernel: info [ 17.515671] IPv6: ADDRCONF(NETDEV_UP): enp24s0f0: link is not ready
2020-06-20T01:05:21.439 worker-1 kernel: info [ 17.515680] IPv6: ADDRCONF(NETDEV_CHANGE): enp24s0f0: link becomes ready
2020-06-20T01:05:21.520 worker-1 kernel: info [ 17.596341] 8021q: 802.1Q VLAN Support v1.8
2020-06-20T01:05:21.520 worker-1 kernel: info [ 17.596347] 8021q: adding VLAN 0 to HW filter on device enp24s0f0

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - node still recovers, but needs an extra reboot. Should be investigated, but will not hold stx.4.0

summary: - DC:Worker node went for second reboot after all nodes power down/up
+ DC: Worker node went for second reboot after all nodes power down/up
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0 stx.config
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Nimalini Rasa, Is this issue reproducible?

Revision history for this message
Nimalini Rasa (nrasa) wrote :

I did not try the test again to see if it is happening again.

Revision history for this message
Bart Wensley (bartwensley) wrote :

The logs in Brent's comment above seem to indicate that there was an 8 minute delay in the worker_config script before it failed:
2020-06-20T01:10:23.000 worker-1 /etc/init.d/worker_config: warning DNS query failed for worker-1
2020-06-20T01:18:48.000 worker-1 /etc/init.d/worker_config: warning DNS query failed after max retries for worker-1 (620 secs)

This could indicate that the "dig" command hung, which seems to be a common issue. If that was the case, we could improve robustness by wrapping the "dig" command with "timeout" command to ensure this didn't happen.

However, the log files at https://files.starlingx.kube.cengn.ca/launchpad/1884335 are no longer available, so I can't confirm whether there were other logs in between the above two logs.

Setting the LP to incomplete and assigning to the originator to see if the logs might still be available. If we don't have the logs, we'll have to close this one until it can be reproduced and logs collected.

Changed in starlingx:
status: Triaged → Incomplete
assignee: nobody → Nimalini Rasa (nrasa)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As discussed with Nimalini, closing since the logs are not available and the investigation can't move forward. This LP is ~9 months old. A new LP with a fresh set of logs should be opened if the issue is reproduced with a more recent image.

Changed in starlingx:
status: Incomplete → Won't Fix
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.