StarlingX

Bug #1851585
Comment #6

Comment 6 for bug 1851585

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-03:

BUILD_ID="2019-11-06_10-52-51
yow-ironpass-18543

1st time, disk replaced with this new one without issue; the manifest applied, the host is unlocked, enabled and available and the health is OK

/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0 SSD 167.68 0.0 N/A CVLT626405AL180BGN INTEL SSDSC2KW18

2nd time, disk replaced with the one that was originally removed failed.
puppet.log
2019-12-03T18:35:26.489 [mNotice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: mount_activate: Failed to activate[0m
2019-12-03T18:35:26.490 [mNotice: 2019-12-03 18:35:26 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-1]/Ceph::Osd[/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0]/returns: '['ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', '-i', '-', 'osd', 'new', u'c9027940-c8cd-4b9c-a73a-c7d367e6ed50']' failed with status code 17[0m

BUILD_ID="2019-11-06_10-52-51
yow-ironpass-18543

1st time, disk replaced with this new one without issue; the manifest applied, the host is unlocked, enabled and available and the health is OK

/dev/disk/by-path/pci-0000:04:00.0-sas-0x5001e6738bc90001-lun-0         SSD          167.68     0.0           N/A                 CVLT626405AL180BGN          INTEL SSDSC2KW18

$ system host-disk-list controller-1
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| uuid                                 | device_no | device_ | device_ | size_ | available_ | rpm          | serial_ | device_path                      |
|                                      | de        | num     | type    | gib   | gib        |              | id      |                                  |
+--------------------------------------+-----------+---------+---------+-------+------------+--------------+---------+----------------------------------+
| 33d64f76-b367-4d9c-a7a9-4aca0ccd61ec | /dev/sda  | 2048    | HDD     | 838.  | 0.0        | Undetermined | S0N1RS0 | /dev/disk/by-path/pci-0000:04:00 |
|                                      |           |         |         | 362   |            |              | G0000B4 | .0-sas-0x5000c50076366105-lun-0  |
|                                      |           |         |         |       |            |              | 44BGA1  |                                  |
|                                      |           |         |         |       |            |              |         |                                  |
| 91f0d50d-30a0-4087-a841-da223e492985 | /dev/sdb  | 2064    | SSD     | 223.  | 0.0        | N/A          | CVTR528 | /dev/disk/by-path/pci-0000:04:00 |
|                                      |           |         |         | 57    |            |              | 101E824 | .0-sas-0x5001e6738bc90001-lun-0  |
|                                      |           |         |         |       |            |              | 0CGN    |                                  |
|                                      |           |         |         |       |            |              |         |

$ fm alarm-list
+-------+----------------------------------------------------------------------+---------------------------------------+----------+---------------+
| Alarm | Reason Text                                                          | Entity ID                             | Severity | Time Stamp    |
| ID    |                                                                      |                                       |          |               |
+-------+----------------------------------------------------------------------+---------------------------------------+----------+---------------+
| 200.  | controller-1 is degraded due to the failure of its 'ceph (osd.0, )'  | host=controller-1.process=ceph (osd.0 | major    | 2019-12-03T18 |
| 006   | process. Auto recovery of this major process is in progress.         | , )                                   |          | :38:32.749026 |
|       |                                                                      |                                       |          |               |
| 200.  | controller-1 experienced a service-affecting failure. Auto-recovery  | host=controller-1                     | critical | 2019-12-03T18 |
| 004   | in progress. Manual Lock and Unlock may be required if auto-recovery |                                       |          | :30:25.259416 |
|       | is unsuccessful.                                                     |                                       |          |               |
|       |                                                                      |                                       |          |               |
| 800.  | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or      | cluster=0c80fb8a-d97e-411b-b6ed-      | warning  | 2019-12-03T18 |
| 001   | undersized]. Please check 'ceph -s' for more details.                | 5192273a17c6                          |          | :11:36.558925 |
|       |                                                                      |                                       |          |               |
| 800.  | Loss of replication in replication group  group-0: OSDs are down     | cluster=0c80fb8a-d97e-411b-b6ed-      | major    | 2019-12-03T18 |
| 011   |                                                                      | 5192273a17c6.peergroup=group-0.host=  |          | :10:35.973510 |
|       |                                                                      | controller-1                          |          |               |
|       |                                                                      |                                       |          |               |
| 400.  | Service group directory-services loss of redundancy; expected 2      | service_domain=controller.            | major    | 2019-12-03T18 |
| 002   | active members but only 1 active member available                    | service_group=directory-services      |          | :10:15.111380 |
|       |                                                                      |                                       |          |               |
| 400.  | Service group web-services loss of redundancy; expected 2 active     | service_domain=controller.            | major    | 2019-12-03T18 |
| 002   | members but only 1 active member available                           | service_group=web-services            |          | :10:14.899447 |
|       |                                                                      |                                       |          |               |
| 400.  | Service group storage-services loss of redundancy; expected 2 active | service_domain=controller.            | major    | 2019-12-03T18 |
| 002   | members but only 1 active member available                           | service_group=storage-services        |          | :10:14.444396 |