AIO - Simplex reboots during application-apply due to ovs-dpdk error

Bug #1837936 reported by Cristopher Lemus
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Austin Sun

Bug Description

Brief Description
-----------------
On a new install of starlingx, during the first application apply, AIO Simplex Baremetal reboots.

Severity
--------
Major: Breaks automated setup. However, system came back and after a while it was unlocked/enabled/available. Manually checking and restarting procedure from this point was used as a workaround. Should this be Minor?

Steps to Reproduce
------------------
Follow up wiki procedure. System reboots on this step: https://wiki.openstack.org/wiki/StarlingX/Containers/Installation#Bring_Up_Services once it started to apply stx-openstack.

Expected Behavior
------------------
System should not reboot during application apply.

Actual Behavior
----------------
System reboots during application apply.

Reproducibility
---------------
Intermittent, on a second run this didn't happened.

System Configuration
--------------------
One Node (Simplex) - Baremetal

Branch/Pull Time/Commit
-----------------------
20190725T013000Z

Last Pass
---------
This issue was not observed with CENGN ISO from 07/24

Timestamp/Logs
--------------
With fm event-list, it looks like the system "experienced a configuration failure", then recovered:

| 2019-07-25T08:22: | clear | 200.011 | controller-0 experienced a configuration failure. | host=controller-0 | critical |
| 31.342663 | | | | | |

| 2019-07-25T08:14: | log | 401.001 | Service group controller-services state change from active to disabling on host controller-0 | service_domain=controller. | critical |
| 47.388679 | | | | service_group=controller-services. | |
| | | | | host=controller-0 | |

| 2019-07-25T08:14: | set | 200.011 | controller-0 experienced a configuration failure. | host=controller-0 | critical |
| 38.298070 | | | | | |
| | | | | | |
| 2019-07-25T08:14: | log | 200.022 | controller-0 is now 'disabled' | host=controller-0.state=disabled | not-applicable |

Full collect is attached.

Test Activity
-------------
Sanity.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Revision history for this message
yong hu (yhu6) wrote :

how many GB RAM on this bare metal server?

We need to monitor THIS bare metal server for days, to tell if it's a HW issue or SW problem.

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :
Download full text (3.5 KiB)

2019-07-25T08:14:38.246 [110919.00135] controller-0 mtcAgent hdl mtcSubfHdlrs.cpp ( 144) enable_subf_handler :Error : controller-0-worker configuration timeout (900 secs)
2019-07-25T08:14:38.246 [110919.00136] controller-0 mtcAgent hbs nodeClass.cpp (1626) alarm_config_failure :Error : controller-0 critical config failure
2019-07-25T08:14:38.246 [110919.00137] controller-0 mtcAgent alm mtcAlarm.cpp ( 417) mtcAlarm_critical :Error : controller-0 setting critical 'Configuration' failure alarm (200.011 )
2019-07-25T08:14:38.246 fmAPI.cpp(471): Enqueue raise alarm request: UUID (67111aab-581c-42fe-85ff-44cfa3f56f4e) alarm id (200.011) instant id (host=controller-0)
2019-07-25T08:14:38.246 [110919.00138] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-0 Task: Worker Configuration Timeout, re-enabling (seq:25)
2019-07-25T08:14:38.247 [110919.00139] controller-0 mtcAgent hbs nodeClass.cpp (7221) ar_manage : Warn : controller-0 auto recovery (try 1 of 2) (0)
2019-07-25T08:14:38.247 fmAPI.cpp(471): Enqueue raise alarm request: UUID (3fca5d7a-9094-477a-9313-e174775a49a5) alarm id (200.022) instant id (host=controller-0.state=disabled)
2019-07-25T08:14:38.247 [110919.00140] controller-0 mtcAgent inv mtcInvApi.cpp ( 987) mtcInvApi_update_states_now: Info : controller-0 unlocked-disabled-failed disabled-failed
2019-07-25T08:14:38.296 fmAlarmUtils.cpp(524): Sending FM raise alarm request: alarm_id (200.011), entity_id (host=controller-0)
2019-07-25T08:14:38.297 fmAlarmUtils.cpp(558): FM Response for raise alarm: (0), alarm_id (200.011), entity_id (host=controller-0)
2019-07-25T08:14:38.297 fmAlarmUtils.cpp(524): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=controller-0.state=disabled)
2019-07-25T08:14:38.338 fmAlarmUtils.cpp(558): FM Response for raise alarm: (0), alarm_id (200.022), entity_id (host=controller-0.state=disabled)
2019-07-25T08:14:38.457 [110919.00141] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 625) send_mtc_cmd : Info : controller-0 sending 'lazy reboot' request (Mgmnt network)
2019-07-25T08:14:38.458 [296073.00142] controller-0 mtcAgent com nodeUtil.cpp (1029) fork_sysreq_reboot : Info : *** Failsafe Reset Thread ***
2019-07-25T08:14:38.458 [110919.00142] controller-0 mtcAgent com nodeUtil.cpp (1083) fork_sysreq_reboot : Info : Forked Fail-Safe (Backup) Reboot Action
2019-07-25T08:14:39.458 [296073.00143] controller-0 mtcAgent com nodeUtil.cpp (1058) fork_sysreq_reboot : Info : sysrq reset in 120 seconds

2019-07-25T07:59:30.000 controller-0 ovs-vsctl: notice ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
2019-07-25T07:59:30.000 controller-0 ovs-vswitchd: err ovs|00019|dpdk|ERR|EAL: invalid parameters for --socket-mem
2019-07-25T07:59:30.000 controller-0 ovs-vswitchd: err ovs|00020|dpdk|ERR|EAL: Invalid 'command line' arguments.
2019-07-25T07:59:30.000 controller-0 ovs-vswitchd: alert ovs|00021|dpdk|EMER|Unable to initialize DPDK: Invalid argument
2019-07-25T07:59:30.000 controller-0 ovs-appctl: warning ovs|00001|unixctl|WARN|failed to...

Read more...

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Yong,

This simplex server is the one that we use daily for sanity. This is the output of free -h:

controller-0:~$ free -h
              total used free shared buff/cache available
Mem: 93G 80G 2.1G 57M 10G 10G
Swap: 0B 0B 0B
controller-0:~$

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

As an additional comment, today, with latest Build from: 20190726T013000Z . Another AIO but this time duplex, failed during the install of the secondary controller. I created a new bug: https://bugs.launchpad.net/starlingx/+bug/1838031 but it might be a duplicated of this intermittent issue.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Austin to determine if this is a duplicate of:
https://bugs.launchpad.net/starlingx/+bug/1829403
where there is an issue with huge page memory allocation, resulting in ovs-dpdk failing to start.

If not, we will re-assign this to the networking team.

tags: added: stx.2.0 stx.config
Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Austin Sun (sunausti)
summary: - AIO - Simplex reboots during application-apply
+ AIO - Simplex reboots during application-apply due to ovs-dpdk error
Revision history for this message
Austin Sun (sunausti) wrote :

 Hi, Ghada:
    the collect info was too late , from the latest collect info, the system was already back to normal, Hugepage was allocated successfully ,
2019-07-25T07:59:30.131Z|00016|dpdk|INFO|EAL ARGS: ovs-vswitchd -n 4 -c 5 --huge-dir /mnt/huge-1048576kB --socket-mem 0,0 --socket-limit 0,0. The DPDK was failed due to hugepage was not allocated successfully
after reboot
2019-07-25T08:22:18.538Z|00015|dpdk|INFO|EAL ARGS: ovs-vswitchd -n 4 -c 5 --huge-dir /mnt/huge-1048576kB --socket-mem 1024,1024 --socket-limit 1024,1024.
and meminfo was correct.
HugePages_Total: 33795
HugePages_Free: 33795
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

from the log, I can not 100% is is same as 1829403, but it should be same as 1829403

I suggest let's duplicate to 1829403.

Revision history for this message
Austin Sun (sunausti) wrote :

since the collect info was late, I put state in-complete , but mostly like should be duplicate 1829403.

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as a duplicate of https://bugs.launchpad.net/starlingx/+bug/1829403 as per recommendation from Austin above.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug is fixed by:
https://review.opendev.org/672634
Merged on 2019-07-29

Changed in starlingx:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.