StarlingX setup goes down every week
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Triaged
|
Low
|
Unassigned |
Bug Description
Brief Description
-----------------
I have a StarlingX R4.0 Baremetal duplex mode deployment up and running.
Initially there were VMs running on the setup and setup went down (Horizon and OpenStack CLI become inaccessible) after 5-6 days. We tried to recover the setup after rebooting both the nodes, and same behavior was observed after few days.
Severity
--------
Critical
Steps to Reproduce
------------------
1. Deploy StarlingX R4.0 Baremetal duplex mode.
2. Leave it for 5-6 days.
Expected Behavior
------------------
The setup should be up and running.
Actual Behavior
----------------
Setup goes down every 5-6 days.
Openstack CLI becomes inaccessible.
“503 service unavailable” on Openstack Horizon.
Compute service fails on active controller node.
Reproducibility
---------------
100% reproducible
System Configuration
-------
Two node system
Last Pass
---------
NO
Timestamp/Logs
--------------
Logs attached.
Workaround
----------
Rebooted both controller nodes. But, same behavior encountered after few days.
tags: | added: stx.storage |
Hi All,
A critical failure is there in kernel logs of the controller-0. This is related to XFS journaling filesystem and seems during cinder- volume- usage-audit, crashes have occurred in XFS handling. Attached kernel logs show that XFS related error occurred around 2020-10-19T22:50:20 and filesystem was shut ("XFS (dm-5): Corruption of in-memory data detected. Shutting down filesystem"). After this particular time, containers/pods in kubernetes started giving errors. One such error from Horizon logs is pasted below.
"2020-10- 19T22:50: 21.269063557Z stdout F 2020-10-19 22:50:21.263034 AH00036: access to / failed (filesystem path '/var')"
So could this issue be related to Ceph. Has anybody else faced such issue?