StarlingX

StarlingX setup goes down every week

Bug #1901131 reported by Akshay on 2020-10-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Triaged	Low	Unassigned

Bug Description

Brief Description
-----------------
I have a StarlingX R4.0 Baremetal duplex mode deployment up and running.
Initially there were VMs running on the setup and setup went down (Horizon and OpenStack CLI become inaccessible) after 5-6 days. We tried to recover the setup after rebooting both the nodes, and same behavior was observed after few days.

Severity
--------
Critical

Steps to Reproduce
------------------
1. Deploy StarlingX R4.0 Baremetal duplex mode.
2. Leave it for 5-6 days.

Expected Behavior
------------------
The setup should be up and running.

Actual Behavior
----------------
Setup goes down every 5-6 days.
Openstack CLI becomes inaccessible.
“503 service unavailable” on Openstack Horizon.
Compute service fails on active controller node.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Two node system

Last Pass
---------
NO

Timestamp/Logs
--------------
Logs attached.

Workaround
----------
Rebooted both controller nodes. But, same behavior encountered after few days.

Tags:

Revision history for this message

Amit (mahajanamit) wrote on 2020-11-02:

controller-0-kernel.log Edit (8.7 KiB, text/plain)

Hi All,

A critical failure is there in kernel logs of the controller-0. This is related to XFS journaling filesystem and seems during cinder-volume-usage-audit, crashes have occurred in XFS handling. Attached kernel logs show that XFS related error occurred around 2020-10-19T22:50:20 and filesystem was shut ("XFS (dm-5): Corruption of in-memory data detected. Shutting down filesystem"). After this particular time, containers/pods in kubernetes started giving errors. One such error from Horizon logs is pasted below.

"2020-10-19T22:50:21.269063557Z stdout F 2020-10-19 22:50:21.263034 AH00036: access to / failed (filesystem path '/var')"

So could this issue be related to Ceph. Has anybody else faced such issue?

Revision history for this message

Amit (mahajanamit) wrote on 2020-11-20:

We are able to see above mentioned logs even after reinstalling the system.

Ghada Khalil (gkhalil) on 2021-04-17

tags:

added: stx.storage

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-08-25:

screening: marking as low priority due to lack of activity

Changed in starlingx:
importance:	Undecided → Low
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

controller-0-kernel.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.