StarlingX setup goes down every week

Bug #1901131 reported by Akshay on 2020-10-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Undecided
Unassigned

Bug Description

Brief Description
-----------------
I have a StarlingX R4.0 Baremetal duplex mode deployment up and running.
Initially there were VMs running on the setup and setup went down (Horizon and OpenStack CLI become inaccessible) after 5-6 days. We tried to recover the setup after rebooting both the nodes, and same behavior was observed after few days.

Severity
--------
Critical

Steps to Reproduce
------------------
1. Deploy StarlingX R4.0 Baremetal duplex mode.
2. Leave it for 5-6 days.

Expected Behavior
------------------
The setup should be up and running.

Actual Behavior
----------------
Setup goes down every 5-6 days.
Openstack CLI becomes inaccessible.
“503 service unavailable” on Openstack Horizon.
Compute service fails on active controller node.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Two node system

Last Pass
---------
NO

Timestamp/Logs
--------------
Logs attached.

Workaround
----------
Rebooted both controller nodes. But, same behavior encountered after few days.

Amit (mahajanamit) wrote :

Hi All,

A critical failure is there in kernel logs of the controller-0. This is related to XFS journaling filesystem and seems during cinder-volume-usage-audit, crashes have occurred in XFS handling. Attached kernel logs show that XFS related error occurred around 2020-10-19T22:50:20 and filesystem was shut ("XFS (dm-5): Corruption of in-memory data detected. Shutting down filesystem"). After this particular time, containers/pods in kubernetes started giving errors. One such error from Horizon logs is pasted below.

"2020-10-19T22:50:21.269063557Z stdout F 2020-10-19 22:50:21.263034 AH00036: access to / failed (filesystem path '/var')"

So could this issue be related to Ceph. Has anybody else faced such issue?

Amit (mahajanamit) wrote :

We are able to see above mentioned logs even after reinstalling the system.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers