Failed to reinstall controller on AIO-DX system

Bug #1860165 reported by David Sullivan
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Ovidiu Poncea

Bug Description

Brief Description
-----------------
Controller-0 enters a failed state after performing a host-reinstall on AIO-DX. Likely an issue with controller-1 as well. Appears to be related to ceph journals.

Severity
--------
Major

Steps to Reproduce
------------------
Install AIO-DX system
Swact to controller-1
Lock controller-0
issue system host-reinstall controller-0
Unlock controller-0

Expected Behavior
------------------
Controller-0 unlocks and becomes available

Actual Behavior
----------------
Controller-0 enters a failed state

Reproducibility
---------------
100% on AIO-DX

System Configuration
--------------------
Seen on AIO-DX systems. Not seen on 2+2 systems. Have not tested on dedicated storage systems.

Branch/Pull Time/Commit
-----------------------
cengn 20200107T000000Z

Last Pass
---------
Unknown

Timestamp/Logs
--------------
Controller-0 Puppet
2020-01-16T04:32:38.851 ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.0']' returned non-zero exit status 1
Controller-1 sysinv
2020-01-16 04:28:23.981 1131440 WARNING ceph_client ... [{u'outb': u'{"checks":{"OSD_DOWN":{"severity":"HEALTH_WARN","summary":{"message":"1 osds down"},"detail":[{"message":"osd.0 (root=storage-tier,chassis=group-0,host=controller-0) is down"}]

Test Activity
-------------
Developer Testing

Workaround
----------
lock node before reinstall, identify all ceph osd disks, wipe the journal by hand on the node you want to reinstall, then reinstall
e.g.
controller-0# dd if=/dev/zero of=/dev/sdb2 bs=1M

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - workaround exists

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.storage
Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
status: New → Triaged
tags: added: stx.4.0
Revision history for this message
John Kung (john-kung) wrote :

Issue was observed on host-reinstall following the completion of a platform upgrade:

2020-05-12-18-09-53_controller/puppet.log:2020-05-12T18:12:23.357 Notice: 2020-05-12 18:12:23 +0000 /Stage[main]/Platform::Ceph::Osds/Platform_ceph_osd[stor-2]/Ceph::Osd[/dev/disk/by-path/pci-0000:00:1f.2-ata-2.0]/Exec[ceph-osd-activate-/dev/disk/by-path/pci-0000:00:1f.2-ata-2.0]/returns: ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.1']' returned non-zero exit status 1

If running manually again on controller-1, observe:
controller-1:/var/log/puppet# /usr/sbin/service ceph --cluster ceph start osd.1
=== osd.1 ===
Mounting xfs on controller-1:/var/lib/ceph/osd/ceph-1
umount: /var/lib/ceph/osd/ceph-1: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
mount: /dev/sdb1 is already mounted or /var/lib/ceph/osd/ceph-1 busy
       /dev/sdb1 is already mounted on /var/lib/ceph/osd/ceph-1
failed: 'modprobe xfs ; egrep -q '^[^ ]+ /var/lib/ceph/osd/ceph-1 ' /proc/mounts && umount /var/lib/ceph/osd/ceph-1 ; mount -t xfs -o rw,noatime,inode64,logbufs=8,logbsize=256k /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0-part1 /var/lib/ceph/osd/ceph-1'

Workaround to resolve issue, per Stefan's notes:
"So, to fix the issue I did the following:
- lock the failed standby-controller
- ssh to the failed controller and wipe the journal partition: dd if=/dev/zero of=/dev/sdb2 bs=1M
- do a host-reinstall on the locked controller
- unlock the controller"

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Stefan Dinescu (stefandinescu)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/729599

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/729599
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=4a5845d7b73746690aa3de439061270c8688e764
Submitter: Zuul
Branch: master

commit 4a5845d7b73746690aa3de439061270c8688e764
Author: Stefan Dinescu <email address hidden>
Date: Wed May 20 13:27:56 2020 +0000

    Wipe OSD journals during host reinstall

    A host reinstall (using the system host-reinstall command) fails
    if the OSD journals are not wiped before reinstalling.

    Based on the ceph-manage-journal.py script, just wiping the
    standard 17KB at the beginning and end of the journal partition
    is not enough, but instead about 100MB of data must be wiped
    instead.

    Change-Id: I165c385958f7f700cae28312998276aa69ed22c3
    Closes-bug: 1860165
    Signed-off-by: Stefan Dinescu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Carmen Rata (crata) wrote :

This issue resurfaced when testing upgrade in an IPv6 lab.

When upgrade needed to be aborted, and a host-downgrade was required for the upgrade to complete, the host-unlock after that failed with ceph error:

Starting ceph services...
=== osd.1 ===
Mounting xfs on controller-1:/var/lib/ceph/osd/ceph-1
Starting Ceph osd.1 on controller-1...
starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
2020-06-10 04:20:26.297 7f59bc2641c0 -1 journal FileJournal::open: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected f5644198-c1f0-46bc-ae86-e09b0487933a, invalid (someone else's?) journal
2020-06-10 04:20:26.298 7f59bc2641c0 -1 filestore(/var/lib/ceph/osd/ceph-1) mount(1871): failed to open journal /var/lib/ceph/osd/ceph-1/journal: (22) Invalid argument
2020-06-10 04:20:26.298 7f59bc2641c0 -1 osd.1 0 OSD:init: unable to mount object store
2020-06-10 04:20:26.298 7f59bc2641c0 -1 ^[[0;31m ** ERROR: osd init failed: (22) Invalid argument^[[0m
failed: 'ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph '
Wed Jun 10 04:20:26 UTC 2020
RC was: 1

Revision history for this message
Frank Miller (sensfan22) wrote :

Re-opening this LP.

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
Carmen Rata (crata) wrote :

The upgrade was from WRCP 20.04 to 20.06.
It also was reproduced without doing an upgrade abort.

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was seen in WCP-78-79 with the designer build when upgrade was abort and downgraded from 20.04 to 20.6

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Stefan Dinescu (stefandinescu) → Ovidiu Poncea (ovidiu.poncea)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/737292

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

The root cause for Carmen's issue in the WCP3-6 lab is an IP collision on the PXE network with another lab. In addition to get the lab config fixed the code change in https://review.opendev.org/737292 is a preferred way to connect to the controller instead of using the pxenetwork IP. Although this change won't fix the underlying lab configuration issue it will help avoid data loss in similar cases.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/737292
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=a62dfdd4eabf161b90513c206e4fdf084165cd75
Submitter: Zuul
Branch: master

commit a62dfdd4eabf161b90513c206e4fdf084165cd75
Author: Ovidiu Poncea <email address hidden>
Date: Mon Jun 22 18:22:41 2020 +0300

    Check system upgrade status by querying 'controller' instead of 'pxecontroller'

    Wipedisk is executed from services that communicate over management network
    so, to avoid possible availability issues of PXE network, we change the interface
    through which wipedisk checks if a system is upgrading by querying 'controller'
    instead of 'pxecontroller'.

    Change-Id: I86810fc612723353638e7938081e1ca80f261b13
    Closes-Bug: 1860165
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.