StarlingX setup goes down every week

Bug #1901131 reported by Akshay
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Unassigned

Bug Description

Brief Description
-----------------
I have a StarlingX R4.0 Baremetal duplex mode deployment up and running.
Initially there were VMs running on the setup and setup went down (Horizon and OpenStack CLI become inaccessible) after 5-6 days. We tried to recover the setup after rebooting both the nodes, and same behavior was observed after few days.

Severity
--------
Critical

Steps to Reproduce
------------------
1. Deploy StarlingX R4.0 Baremetal duplex mode.
2. Leave it for 5-6 days.

Expected Behavior
------------------
The setup should be up and running.

Actual Behavior
----------------
Setup goes down every 5-6 days.
Openstack CLI becomes inaccessible.
“503 service unavailable” on Openstack Horizon.
Compute service fails on active controller node.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Two node system

Last Pass
---------
NO

Timestamp/Logs
--------------
Logs attached.

Workaround
----------
Rebooted both controller nodes. But, same behavior encountered after few days.

Tags: stx.storage
Revision history for this message
Amit (mahajanamit) wrote :

Hi All,

A critical failure is there in kernel logs of the controller-0. This is related to XFS journaling filesystem and seems during cinder-volume-usage-audit, crashes have occurred in XFS handling. Attached kernel logs show that XFS related error occurred around 2020-10-19T22:50:20 and filesystem was shut ("XFS (dm-5): Corruption of in-memory data detected. Shutting down filesystem"). After this particular time, containers/pods in kubernetes started giving errors. One such error from Horizon logs is pasted below.

"2020-10-19T22:50:21.269063557Z stdout F 2020-10-19 22:50:21.263034 AH00036: access to / failed (filesystem path '/var')"

So could this issue be related to Ceph. Has anybody else faced such issue?

Revision history for this message
Amit (mahajanamit) wrote :

We are able to see above mentioned logs even after reinstalling the system.

Ghada Khalil (gkhalil)
tags: added: stx.storage
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as low priority due to lack of activity

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.