Bug #1851585 “Replacing OSD hard disk on controller node fails” : Bugs : StarlingX

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-11-15:

#1

Marking gating for stx.4.0. User should be able to replace OSD disks.

tags:	added: stx.4.0 stx.config stx.distro.other
Changed in starlingx:
assignee:	nobody → Ovidiu Poncea (ovidiu.poncea)
status:	New → Triaged

Ghada Khalil (gkhalil) on 2019-11-15

Changed in starlingx:
importance:	Undecided → Medium

Yang Liu (yliu12) on 2019-11-19

tags:

added: stx.retestneeded

Frank Miller (sensfan22) on 2019-11-20

Changed in starlingx:
assignee:	Ovidiu Poncea (ovidiu.poncea) → Paul-Ionut Vaduva (pvaduva)

Revision history for this message

Paul-Ionut Vaduva (pvaduva) wrote on 2019-11-29:

#2

On a VirtualBox deployment, I locked controller-1, shut it down replaced a virtual drive part of a OSD (/dev/sdc) with a new one. reboot the VM and the VM booted just fine.
I need more details about the setup logs, information about the replacement disk (was it wiped, new etc.) information about the Ceph cluster setup, and if possible the setup where it was reproduced.

Revision history for this message

Paul-Ionut Vaduva (pvaduva) wrote on 2019-12-02:

#3

Hi Boris,

I am unable to reproduce the bug in a virtual environment, if there are no logs available can you please reproduce the bug, attach the logs and give me access to the setup ?
Thank you

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-03:

#4

Download full text (8.2 KiB)

The lab in question appears to be configured as 2+ 2
controller-1 has the following storage configuration

apiVersion: starlingx.windriver.com/v1
kind: HostProfile
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: controller-1-profile
  namespace: deployment
spec:
  administrativeState: unlocked
  bootDevice: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
  clockSynchronization: ntp
  console: ttyS0,115200n8
  installOutput: graphical
  interfaces:
    ethernet:
    - class: platform
      dataNetworks: []
      name: oam0
      platformNetworks:
      - oam
      port:
        name: enp13s0f0
    - class: platform
      dataNetworks: []
      name: mgmt0
      platformNetworks:
      - mgmt
      - cluster-host
      port:
        name: enp13s0f1
  personality: controller
  powerOn: true
  provisioningMode: static
  rootDevice: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
  storage:
    filesystems:
    - name: scratch
      size: 8
    - name: backup
      size: 25
    - name: docker
      size: 30
    - name: kubelet
      size: 10
    osds:
    - cluster: ceph_cluster
      function: osd
      path: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0
  subfunctions:
  - controller

$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+-----------------------------------------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_ | device_path |
| | de | num | type | gib | gib | | id | |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+-----------------------------------------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda | 2048 | HDD | 838. | 0.0 | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0 |
| | | | | 362 | | | G0000B4 | |
| | | | | | | | 44BGA1 | |
| | | | | | | | | |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb | 2064 | SSD | 223. | 0.0 | N/A | CVTR528 | /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0 |
| | | | | 57 | | | 101E824 | |
|
| ...

The lab in question appears to be configured as 2+ 2 
controller-1 has the following storage configuration

apiVersion: starlingx.windriver.com/v1
kind: HostProfile
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: controller-1-profile
  namespace: deployment
spec:
  administrativeState: unlocked
  bootDevice: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
  clockSynchronization: ntp
  console: ttyS0,115200n8
  installOutput: graphical
  interfaces:
    ethernet:
    - class: platform
      dataNetworks: []
      name: oam0
      platformNetworks:
      - oam
      port:
        name: enp13s0f0
    - class: platform
      dataNetworks: []
      name: mgmt0
      platformNetworks:
      - mgmt
      - cluster-host
      port:
        name: enp13s0f1
  personality: controller
  powerOn: true
  provisioningMode: static
  rootDevice: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
  storage:
    filesystems:
    - name: scratch
      size: 8
    - name: backup
      size: 25
    - name: docker
      size: 30
    - name: kubelet
      size: 10
    osds:
    - cluster: ceph_cluster
      function: osd
      path: /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0
  subfunctions:
  - controller

$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+-----------------------------------------------------------------+
| uuid                                 | device_no | device_ | device_ | size_ | available_ | rpm          | serial_ | device_path                                                     |
|                                      | de        | num     | type    | gib   | gib        |              | id      |                                                                 |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+-----------------------------------------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda  | 2048    | HDD     | 838.  | 0.0        | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0 |
|                                      |           |         |         | 362   |            |              | G0000B4 |                                                                 |
|                                      |           |         |         |       |            |              | 44BGA1  |                                                                 |
|                                      |           |         |         |       |            |              |         |                                                                 |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb  | 2064    | SSD     | 223.  | 0.0        | N/A          | CVTR528 | /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0 |
|                                      |           |         |         | 57    |            |              | 101E824 |                                                                 |
|                       
              |           |         |         |       |            |              | 0CGN    |

$ system host-stor-list controller-1
+--------------------------------------+----------+-------+------------+----------------------------| uuid                                 | function | osdid | state      | idisk_uuid                           | journal_path                                                          | journal_node | journal_size_gib | tier_name |
+--------------------------------------+----------+-------+------------+----------------------------
| c9027940-c8cd-4b9c-a73a-c7d367e6ed50 | osd      | 0     | configured | 91f0d50d-30a0-4087-a841-da223e492985 | /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0-part2 | /dev/sdb2    | 1                | storage

$ ls -l /dev/disk/by-path/*
lrwxrwxrwx 1 root root  9 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0 -> ../../sda
lrwxrwxrwx 1 root root 10 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0-part4 -> ../../sda4
lrwxrwxrwx 1 root root  9 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0 -> ../../sdb
lrwxrwxrwx 1 root root 10 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Dec  3 15:56 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0-part2 -> ../../sdb2

controller-1:~$  udevadm info -q all -n /dev/sda
P: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:08.0/0000:04:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sda
N: sda
S: disk/by-id/scsi-35000c50076366107
S: disk/by-id/wwn-0x5000c50076366107
S: disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
E: DEVLINKS=/dev/disk/by-id/scsi-35000c50076366107 /dev/disk/by-id/wwn-0x5000c50076366107 /dev/disk/by-path/pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
E: DEVNAME=/dev/sda
E: DEVPATH=/devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:08.0/0000:04:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sda
E: DEVTYPE=disk
E: ID_BUS=scsi
E: ID_MODEL=ST900MM0026
E: ID_MODEL_ENC=ST900MM0026\x20\x20\x20\x20\x20
E: ID_PART_TABLE_TYPE=gpt
E: ID_PATH=pci-0000:04:00.0-sas-0x5000c50076366105-lun-0
E: ID_PATH_TAG=pci-0000_04_00_0-sas-0x5000c50076366105-lun-0
E: ID_REVISION=0003
E: ID_SAS_PATH=pci-0000:04:00.0-sas-phy0-lun-0
E: ID_SCSI=1
E: ID_SCSI_SERIAL=S0N1RS0G0000B444BGA1
E: ID_SERIAL=35000c50076366107
E: ID_SERIAL_SHORT=5000c50076366107
E: ID_TYPE=disk
E: ID_VENDOR=SEAGATE
E: ID_VENDOR_ENC=SEAGATE\x20
E: ID_WWN=0x5000c50076366107
E: ID_WWN_WITH_EXTENSION=0x5000c50076366107
E: MAJOR=8
E: MINOR=0
E: MPATH_SBIN_PATH=/sbin
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=865273

controller-1:~$  udevadm info -q all -n /dev/sdb
P: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:08.0/0000:04:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sdb
N: sdb
S: disk/by-id/ata-INTEL_SSDSC2BW240H6_CVTR528101E8240CGN
S: disk/by-id/wwn-0x55cd2e414c8d201a
S: disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0
E: DEVLINKS=/dev/disk/by-id/ata-INTEL_SSDSC2BW240H6_CVTR528101E8240CGN /dev/disk/by-id/wwn-0x55cd2e414c8d201a /dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0
E: DEVNAME=/dev/sdb
E: DEVPATH=/devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:08.0/0000:04:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sdb
E: DEVTYPE=disk
E: ID_ATA=1
E: ID_ATA_DOWNLOAD_MICROCODE=1
E: ID_ATA_FEATURE_SET_APM=1
E: ID_ATA_FEATURE_SET_APM_CURRENT_VALUE=254
E: ID_ATA_FEATURE_SET_APM_ENABLED=1
E: ID_ATA_FEATURE_SET_HPA=1
E: ID_ATA_FEATURE_SET_HPA_ENABLED=1
E: ID_ATA_FEATURE_SET_PM=1
E: ID_ATA_FEATURE_SET_PM_ENABLED=1
E: ID_ATA_FEATURE_SET_PUIS=1
E: ID_ATA_FEATURE_SET_PUIS_ENABLED=0
E: ID_ATA_FEATURE_SET_SECURITY=1
E: ID_ATA_FEATURE_SET_SECURITY_ENABLED=0
E: ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=2
E: ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=4
E: ID_ATA_FEATURE_SET_SMART=1
E: ID_ATA_FEATURE_SET_SMART_ENABLED=1
E: ID_ATA_ROTATION_RATE_RPM=0
E: ID_ATA_SATA=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN1=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN2=1
E: ID_ATA_WRITE_CACHE=1
E: ID_ATA_WRITE_CACHE_ENABLED=1
E: ID_BUS=ata
E: ID_MODEL=INTEL_SSDSC2BW240H6
E: ID_MODEL_ENC=INTEL\x20SSDSC2BW240H6\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_PART_TABLE_TYPE=gpt
E: ID_PATH=pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0
E: ID_PATH_TAG=pci-0000_04_00_0-sas-0x5001e6738bc90001-lun-0
E: ID_REVISION=RG20
E: ID_SAS_PATH=pci-0000:04:00.0-sas-phy1-lun-0
E: ID_SERIAL=INTEL_SSDSC2BW240H6_CVTR528101E8240CGN
E: ID_SERIAL_SHORT=CVTR528101E8240CGN
E: ID_TYPE=disk
E: ID_WWN=0x55cd2e414c8d201a
E: ID_WWN_WITH_EXTENSION=0x55cd2e414c8d201a
E: MAJOR=8
E: MINOR=16
E: MPATH_SBIN_PATH=/sbin
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=866651

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-03:

#5

controller-1 failed after on 2nd disk replacement operation

[sysadmin@controller-0 log(keystone_admin)]$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_ | device_path |
| | de | num | type | gib | gib | | id | |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda | 2048 | HDD | 838. | 0.0 | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 362 | | | G0000B4 | .0-sas-0x5000c50076366105-lun-0 |
| | | | | | | | 44BGA1 | |
| | | | | | | | | |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb | 2064 | SSD | 223. | 0.0 | N/A | CVTR528 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 57 | | | 101E824 | .0-sas-0x5001e6738bc90001-lun-0 |
| | | | | | | | 0CGN | |
| | | | | | | | |

controller-1 failed after on 2nd disk replacement operation

[sysadmin@controller-0 log(keystone_admin)]$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| uuid                                 | device_no | device_ | device_ | size_ | available_ | rpm          | serial_ | device_path                      |
|                                      | de        | num     | type    | gib   | gib        |              | id      |                                  |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda  | 2048    | HDD     | 838.  | 0.0        | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00 |
|                                      |           |         |         | 362   |            |              | G0000B4 | .0-sas-0x5000c50076366105-lun-0  |
|                                      |           |         |         |       |            |              | 44BGA1  |                                  |
|                                      |           |         |         |       |            |              |         |                                  |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb  | 2064    | SSD     | 223.  | 0.0        | N/A          | CVTR528 | /dev/disk/by-path/pci-0000:04:00 |
|                                      |           |         |         | 57    |            |              | 101E824 | .0-sas-0x5001e6738bc90001-lun-0  |
|                                      |           |         |         |       |            |              | 0CGN    |                                  |
|                                      |           |         |         |       |            |              |         |

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-03:

#6

Download full text (6.9 KiB)

BUILD_ID="2019-11-06_10-52-51
yow-ironpass-18543

1st time, disk replaced with this new one without issue; the manifest applied, the host is unlocked, enabled and available and the health is OK

/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0 SSD 167.68 0.0 N/A CVLT626405AL180BGN INTEL SSDSC2KW18

2nd time, disk replaced with the one that was originally removed failed.
puppet.log
2019-12-03T18:35:26.489 [mNotice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: mount_activate: Failed to activate[0m
2019-12-03T18:35:26.490 [mNotice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: '['ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', '-i', '-', 'osd', 'new', u'c9027940-c8cd-4b9c-a73a-c7d367e6ed50']' failed with status code 17[0m

$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_ | device_path |
| | de | num | type | gib | gib | | id | |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda | 2048 | HDD | 838. | 0.0 | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 362 | | | G0000B4 | .0-sas-0x5000c50076366105-lun-0 |
| | | | | | | | 44BGA1 | |
| | | | | | | | | |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb | 2064 | SSD | 223. | 0.0 | N/A | CVTR528 | /dev/disk/by-path/pci-0000:04:00 |
| | | | | 57 | | | 101E824 | .0-sas-0x5001e6738bc90001-lun-0 |
| | | | | | | | 0CGN | |
| | | | | | | | |

$ fm alarm-list
+-------+-----------------...

BUILD_ID="2019-11-06_10-52-51
yow-ironpass-18543

1st time, disk replaced with this new one without issue; the manifest applied, the host is unlocked, enabled and available and the health is OK

/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0         SSD          167.68     0.0           N/A                 CVLT626405AL180BGN          INTEL SSDSC2KW18

2nd time, disk replaced with the one that was originally removed failed.
puppet.log
2019-12-03T18:35:26.489 [mNotice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: mount_activate: Failed to activate[0m
2019-12-03T18:35:26.490 [mNotice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: '['ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', '-i', '-', 'osd', 'new', u'c9027940-c8cd-4b9c-a73a-c7d367e6ed50']' failed with status code 17[0m

$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| uuid                                 | device_no | device_ | device_ | size_ | available_ | rpm          | serial_ | device_path                      |
|                                      | de        | num     | type    | gib   | gib        |              | id      |                                  |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda  | 2048    | HDD     | 838.  | 0.0        | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00 |
|                                      |           |         |         | 362   |            |              | G0000B4 | .0-sas-0x5000c50076366105-lun-0  |
|                                      |           |         |         |       |            |              | 44BGA1  |                                  |
|                                      |           |         |         |       |            |              |         |                                  |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb  | 2064    | SSD     | 223.  | 0.0        | N/A          | CVTR528 | /dev/disk/by-path/pci-0000:04:00 |
|                                      |           |         |         | 57    |            |              | 101E824 | .0-sas-0x5001e6738bc90001-lun-0  |
|                                      |           |         |         |       |            |              | 0CGN    |                                  |
|                                      |           |         |         |       |            |              |         |

$ fm alarm-list
+-------+----------------------------------------------------------------------+---------------------------------------+----------+---------------+
| Alarm | Reason Text                                                          | Entity ID                             | Severity | Time Stamp    |
| ID    |                                                                      |                                       |          |               |
+-------+----------------------------------------------------------------------+---------------------------------------+----------+---------------+
| 200.  | controller-1 is degraded due to the failure of its 'ceph (osd.0, )'  | host=controller-1.process=ceph (osd.0 | major    | 2019-12-03T18 |
| 006   | process. Auto recovery of this major process is in progress.         | , )                                   |          | :38:32.749026 |
|       |                                                                      |                                       |          |               |
| 200.  | controller-1 experienced a service-affecting failure. Auto-recovery  | host=controller-1                     | critical | 2019-12-03T18 |
| 004   | in progress. Manual Lock and Unlock may be required if auto-recovery |                                       |          | :30:25.259416 |
|       | is unsuccessful.                                                     |                                       |          |               |
|       |                                                                      |                                       |          |               |
| 800.  | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or      | cluster=0c80fb8a-d97e-411b-b6ed-      | warning  | 2019-12-03T18 |
| 001   | undersized]. Please check 'ceph -s' for more details.                | 5192273a17c6                          |          | :11:36.558925 |
|       |                                                                      |                                       |          |               |
| 800.  | Loss of replication in replication group  group-0: OSDs are down     | cluster=0c80fb8a-d97e-411b-b6ed-      | major    | 2019-12-03T18 |
| 011   |                                                                      | 5192273a17c6.peergroup=group-0.host=  |          | :10:35.973510 |
|       |                                                                      | controller-1                          |          |               |
|       |                                                                      |                                       |          |               |
| 400.  | Service group directory-services loss of redundancy; expected 2      | service_domain=controller.            | major    | 2019-12-03T18 |
| 002   | active members but only 1 active member available                    | service_group=directory-services      |          | :10:15.111380 |
|       |                                                                      |                                       |          |               |
| 400.  | Service group web-services loss of redundancy; expected 2 active     | service_domain=controller.            | major    | 2019-12-03T18 |
| 002   | members but only 1 active member available                           | service_group=web-services            |          | :10:14.899447 |
|       |                                                                      |                                       |          |               |
| 400.  | Service group storage-services loss of redundancy; expected 2 active | service_domain=controller.            | major    | 2019-12-03T18 |
| 002   | members but only 1 active member available                           | service_group=storage-services        |          | :10:14.444396 |

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-03:

#7

controller-1_20191203.184742.tgz Edit (90.7 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-03:

#8

controller-0_20191203.184742.tgz Edit (129.3 MiB, application/x-tar)

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2019-12-04:

#9

Failure observed the second time is expected behavior, disks used as replacements must be wiped before re-use. The system avoids wiping data that seems to be correct (i.e. reusing a disk from an OSD).
This information should already be in the disk replacement procedure docs, if it is not then it should be written there.

Revision history for this message

Paul-Ionut Vaduva (pvaduva) wrote on 2019-12-05:

#10

I took a closer look with Ovidiu and apparently there is a corner case where the disk wipe does not carry out successfully the task of wiping all necessary data for partitions. Merely when the are multiple existing partitions on the disk. In such cases it's true that some problems may arise, the ones Wendy got when replacing back the original disk (which of course contained the original partitions). We are going to accept this as a bug, and treat it accordingly.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-06: Fix proposed to config (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/697679

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-12: Fix merged to config (master)

#12

Reviewed: https://review.opendev.org/697679
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=dfeb00e125808049d636c039782b5d0b748bba1b
Submitter: Zuul
Branch: master

commit dfeb00e125808049d636c039782b5d0b748bba1b
Author: Paul Vaduva <email address hidden>
Date: Tue Dec 10 06:46:35 2019 -0500

Wiping the disk preexistent partitions

When there are pre-existent partitions on newly installed
disk Ceph recognizes them and throws an error

    Change-Id: I6641fc4ae0f5252aa500052e61ee79db1cbdacdd
    Closes-Bug: 1851585
    Signed-off-by: Paul Vaduva <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2021-10-27

tags:

removed: stx.retestneeded

StarlingX

Replacing OSD hard disk on controller node fails

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches