StarlingX setup goes down every week

Bug #1901131 reported by Akshay on 2020-10-23
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

Brief Description
I have a StarlingX R4.0 Baremetal duplex mode deployment up and running.
Initially there were VMs running on the setup and setup went down (Horizon and OpenStack CLI become inaccessible) after 5-6 days. We tried to recover the setup after rebooting both the nodes, and same behavior was observed after few days.


Steps to Reproduce
1. Deploy StarlingX R4.0 Baremetal duplex mode.
2. Leave it for 5-6 days.

Expected Behavior
The setup should be up and running.

Actual Behavior
Setup goes down every 5-6 days.
Openstack CLI becomes inaccessible.
“503 service unavailable” on Openstack Horizon.
Compute service fails on active controller node.

100% reproducible

System Configuration
Two node system

Last Pass

Logs attached.

Rebooted both controller nodes. But, same behavior encountered after few days.

Amit (mahajanamit) wrote :

Hi All,

A critical failure is there in kernel logs of the controller-0. This is related to XFS journaling filesystem and seems during cinder-volume-usage-audit, crashes have occurred in XFS handling. Attached kernel logs show that XFS related error occurred around 2020-10-19T22:50:20 and filesystem was shut ("XFS (dm-5): Corruption of in-memory data detected. Shutting down filesystem"). After this particular time, containers/pods in kubernetes started giving errors. One such error from Horizon logs is pasted below.

"2020-10-19T22:50:21.269063557Z stdout F 2020-10-19 22:50:21.263034 AH00036: access to / failed (filesystem path '/var')"

So could this issue be related to Ceph. Has anybody else faced such issue?

Amit (mahajanamit) wrote :

We are able to see above mentioned logs even after reinstalling the system.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers