Storage node comes up as failed on disk swap

Bug #1797626 reported by Maria Yousaf
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Low
Tingjie Chen

Bug Description

Brief Description
-----------------
Storage node comes up as failed on disk swap

Severity
--------
Major

Steps to Reproduce
------------------
1. storage-0 had one OSD provisioned. That OSD was using a collocated journal of 1024 MiB
2. storage-0 was locked. A journal disk was provisioned. The OSD disk was updated to use the dedicated journal disk for storage as opposed to collocated journals with a size of 2048 MiB
3. storage-0 was unlocked. This was repeated for all storage nodes within the same peer group.
4. storage-0 was then locked and powered off. The OSD and journal disks were swapped
5. The node was then powered on and unlocked after it came online.
6. The storage node eventually went into Failed state. I'm seeing the following reported in puppet.log:

n[check]/Exec[manage-partitions-check]/returns: sysinv 2018-10-12 17:56:57.855 23584 CRITICAL sysinv [-] Partition /dev/disk/by-path/pci-0000:84:00.0-nvme-1-part2 not present in OS^[[0m

0000:84:00.0 is the OSD
0000:05:00.0 is the journal

Speaking to a designer, it may be the case that partitions should be re-created by the system according to what is in the database.

Workaround is to lock the host, do a host reinstall and unlock

Expected Behavior
------------------
Storage node recovers

Actual Behavior
----------------
Storage node fails

Reproducibility
---------------
Tried once so far

System Configuration
--------------------
Storage system

Branch/Pull Time/Commit
-----------------------
stx.10.2018 as of 2018-10-12_01-52-00

Timestamp/Logs
--------------
2018-10-12 17:56:57.855

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Targeting stx.2019.03 as this is a corner case during disk replacement

Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.03 stx.config
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Bruce Jones (brucej)
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → Cindy Xie (xxie1)
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Zhuweiwei (vivian.zhu)
Changed in starlingx:
assignee: Zhuweiwei (vivian.zhu) → Changcheng Intel (liuc-intel)
Revision history for this message
Changcheng Intel (liuc-intel) wrote :

1. How to set up the test environment? Please give detail documents about hardware requirement and the running software.

2. What's the detail steps to reproduce this problem?

3. From the given log "Partition /dev/disk/by-path/pci-0000:84:00.0-nvme-1-part2",
   Is device "pci-0000:84:00.0-nvme-1" still exist?

4. After swapping, the OSD should be bundled to device "0000:05:00.0", could "
Maria Yousaf" check why the OSD still try to be bundled to device "pci-0000:84:00.0-nvme-1-part2"

Revision history for this message
Changcheng Intel (liuc-intel) wrote :

Please "Maria Yousaf" feedback above questions.

Revision history for this message
Changcheng Intel (liuc-intel) wrote :

Hi Maria Yousaf,
   Could you give the feedback?
   This problem is reported on 2018-10-13, what's the latest status?

Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Changed in starlingx:
assignee: Changcheng Intel (liuc-intel) → Tingjie Chen (silverhandy)
Revision history for this message
Bruce Jones (brucej) wrote :

Maria please retest this bug, verify that it is still an issue and if so, provide detailed repro steps. Thanks!

Changed in starlingx:
status: Triaged → Incomplete
tags: added: stx.storage
Cindy Xie (xxie1)
tags: added: stx.distro.other
Cindy Xie (xxie1)
tags: removed: stx.distro.other
Revision history for this message
Cindy Xie (xxie1) wrote :

Ovidiu comments that this is expected behavior.

tags: removed: stx.2.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Changing the priority to Low as this is no longer considered gating based on Cindy's comment above.
If this is expected behavior, I suggest to update the status to Invalid so that the bug is considered closed. Is there a reason to continue to keep this open?

Changed in starlingx:
importance: Medium → Low
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Incomplete → Invalid
Numan Waheed (nwaheed)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.