Replacing OSD hard disk on controller node fails

Bug #1851585 reported by Boris Shteinbock on 2019-11-06
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
Testing HDD replacement feature failed. After a simulated HDD failure and a subsequent replacement of the HDD used for OSD, controller node failed to unlock.

Severity
Major

Steps to Reproduce
------------------
1. Controller-1 Node was locked and shutdown
2. HDD was removed and replaced with a new one.
3. Node was booted.

Expected Behavior
------------------
After a reboot it is expected for sysinv to update node inventory and replace OSD usage

Actual Behavior
----------------
When node booted with a new disk, udevd began to segfault every minute and system inventory was broken, preventing a correct update. Unlocking the node was also unsuccessful and node entered a reboot loop.

First segfault message:
2019-10-30T11:42:21.948 controller-1 kernel: info [ 8.385476] systemd-udevd[294]: segfault at 0 ip 00007f11c3a89e14 sp 00007ffea9ecaac8 error 4 in libc-2.17.so[7f11c3920000+1c2000]

I was unable to pull core dumps as they were not present in /var/crash. It seems that systemd segfault behavior is complicated on CentOS-based systems

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-10-17_20-00-00"

Last Pass
---------
No

Timestamp/Logs
--------------
First boot after disk replacement occurred around 2019-10-30T11:42:21.917
First segfault observed at 2019-10-30T11:42:21

Test Activity
-------------
Feature Testing

Frank Miller (sensfan22) wrote :

Marking gating for stx.4.0. User should be able to replace OSD disks.

tags: added: stx.4.0 stx.config stx.distro.other
Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
status: New → Triaged
Ghada Khalil (gkhalil) on 2019-11-15
Changed in starlingx:
importance: Undecided → Medium
Yang Liu (yliu12) on 2019-11-19
tags: added: stx.retestneeded
Frank Miller (sensfan22) on 2019-11-20
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Paul-Ionut Vaduva (pvaduva)
Paul-Ionut Vaduva (pvaduva) wrote :

On a VirtualBox deployment, I locked controller-1, shut it down replaced a virtual drive part of a OSD (/dev/sdc) with a new one. reboot the VM and the VM booted just fine.
I need more details about the setup logs, information about the replacement disk (was it wiped, new etc.) information about the Ceph cluster setup, and if possible the setup where it was reproduced.

Paul-Ionut Vaduva (pvaduva) wrote :

Hi Boris,

I am unable to reproduce the bug in a virtual environment, if there are no logs available can you please reproduce the bug, attach the logs and give me access to the setup ?
Thank you

Wendy Mitchell (wmitchellwr) wrote :
Download full text (8.2 KiB)

The lab in question appears to be configured as 2+ 2
controller-1 has the following storage configuration

apiVersion: starlingx.windriver.com/v1
kind: HostProfile
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: controller-1-profile
  namespace: deployment
spec:
  administrativeState: unlocked
  bootDevice: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
  clockSynchronization: ntp
  console: ttyS0,115200n8
  installOutput: graphical
  interfaces:
    ethernet:
    - class: platform
      dataNetworks: []
      name: oam0
      platformNetworks:
      - oam
      port:
        name: enp13s0f0
    - class: platform
      dataNetworks: []
      name: mgmt0
      platformNetworks:
      - mgmt
      - cluster-host
      port:
        name: enp13s0f1
  personality: controller
  powerOn: true
  provisioningMode: static
  rootDevice: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
  storage:
    filesystems:
    - name: scratch
      size: 8
    - name: backup
      size: 25
    - name: docker
      size: 30
    - name: kubelet
      size: 10
    osds:
    - cluster: ceph_cluster
      function: osd
      path: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0
  subfunctions:
  - controller

$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+-----------------------------------------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_ | device_path |
| | de | num | type | gib | gib | | id | |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+-----------------------------------------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda | 2048 | HDD | 838. | 0.0 | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0 |
| | | | | 362 | | | G0000B4 | |
| | | | | | | | 44BGA1 | |
| | | | | | | | | |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb | 2064 | SSD | 223. | 0.0 | N/A | CVTR528 | /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0 |
| | | | | 57 | | | 101E824 | |
|
              | ...

Read more...

Wendy Mitchell (wmitchellwr) wrote :

controller-1 failed after on 2nd disk replacement operation

[sysadmin@controller-0 log(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | disabled | failed |
| 3 | worker-0 | worker | unlocked | enabled | available |
| 4 | worker-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[sysadmin@controller-0 log(keystone_admin)]$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_ | device_path |
| | de | num | type | gib | gib | | id | |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda | 2048 | HDD | 838. | 0.0 | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 362 | | | G0000B4 | .0-sas-0x5000c50076366105-lun-0 |
| | | | | | | | 44BGA1 | |
| | | | | | | | | |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb | 2064 | SSD | 223. | 0.0 | N/A | CVTR528 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 57 | | | 101E824 | .0-sas-0x5001e6738bc90001-lun-0 |
| | | | | | | | 0CGN | |
| | | | | | | | |

Wendy Mitchell (wmitchellwr) wrote :
Download full text (6.9 KiB)

BUILD_ID="2019-11-06_10-52-51
yow-ironpass-18543

1st time, disk replaced with this new one without issue; the manifest applied, the host is unlocked, enabled and available and the health is OK

/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0 SSD 167.68 0.0 N/A CVLT626405AL180BGN INTEL SSDSC2KW18

2nd time, disk replaced with the one that was originally removed failed.
puppet.log
2019-12-03T18:35:26.489 Notice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: mount_activate: Failed to activate
2019-12-03T18:35:26.490 Notice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: '['ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', '-i', '-', 'osd', 'new', u'c9027940-c8cd-4b9c-a73a-c7d367e6ed50']' failed with status code 17

$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_ | device_path |
| | de | num | type | gib | gib | | id | |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda | 2048 | HDD | 838. | 0.0 | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 362 | | | G0000B4 | .0-sas-0x5000c50076366105-lun-0 |
| | | | | | | | 44BGA1 | |
| | | | | | | | | |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb | 2064 | SSD | 223. | 0.0 | N/A | CVTR528 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 57 | | | 101E824 | .0-sas-0x5001e6738bc90001-lun-0 |
| | | | | | | | 0CGN | |
| | | | | | | | |

$ fm alarm-list
+-------+-----------------...

Read more...

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Ovidiu Poncea (ovidiu.poncea) wrote :

Failure observed the second time is expected behavior, disks used as replacements must be wiped before re-use. The system avoids wiping data that seems to be correct (i.e. reusing a disk from an OSD).
This information should already be in the disk replacement procedure docs, if it is not then it should be written there.

Paul-Ionut Vaduva (pvaduva) wrote :

I took a closer look with Ovidiu and apparently there is a corner case where the disk wipe does not carry out successfully the task of wiping all necessary data for partitions. Merely when the are multiple existing partitions on the disk. In such cases it's true that some problems may arise, the ones Wendy got when replacing back the original disk (which of course contained the original partitions). We are going to accept this as a bug, and treat it accordingly.

Fix proposed to branch: master
Review: https://review.opendev.org/697679

Changed in starlingx:
status: Triaged → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers