B&R: Ansible fails to restore Ceph when backup is taken on controller-1

Bug #1899444 reported by Ovidiu Poncea
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Ovidiu Poncea

Bug Description

Brief Description
-----------------
Restore fails with ansible error when ceph is enabled and backup is taken on controller-1.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
1. Install any setup with Ceph
2. Swact to controller-1
3. Do a backup
4. Copy backup off-site
5. Reinstall controller-0
6. Copy the backup to controller-0
7. Run ansible restore playbook

Expected Behavior
------------------
Ansible should complete

Actual Behavior
----------------
Ansible fails:

fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/etc/init.d/ceph", "start"], "delta": "0:00:02.940099", "end": "2020-08-25 02:20:24.630149", "msg": "non-zero return code", "rc": 1, "start": "2020-08-25 02:20:21.690050", "stderr": "2020-08-25 02:20:22.160 7f8f8e4de140 -1 mon.controller@-1(probing).mgr e1 Failed to load mgr commands: (2) No such file or directory\n2020-08-25 02:20:23.226 7f2475d631c0 -1 journal do_read_entry(318771200): bad header magic\n2020-08-25 02:20:23.226 7f2475d631c0 -1 journal do_read_entry(318771200): bad header magic\n2020-08-25 02:20:24.022 7f2475d631c0 -1 osd.0 52 log_to_monitors

{default=true}\n2020-08-25 02:20:24.622 7fb9823451c0 -1 OSD id 0 != my id 1", "stderr_lines": ["2020-08-25 02:20:22.160 7f8f8e4de140 -1 mon.controller@-1(probing).mgr e1 Failed to load mgr commands: (2) No such file or directory", "2020-08-25 02:20:23.226 7f2475d631c0 -1 journal do_read_entry(318771200): bad header magic", "2020-08-25 02:20:23.226 7f2475d631c0 -1 journal do_read_entry(318771200): bad header magic", "2020-08-25 02:20:24.022 7f2475d631c0 -1 osd.0 52 log_to_monitors {default=true}

", "2020-08-25 02:20:24.622 7fb9823451c0 -1 OSD id 0 != my id 1"], "stdout": "=== mon.controller === \nStarting Ceph mon.controller on controller-0...\n=== osd.0 === \nStarting Ceph osd.0 on controller-0...\nstarting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal\n=== osd.1 === \nMounting xfs on controller-0:/var/lib/ceph/osd/ceph-1\nStarting Ceph osd.1 on controller-0...\nfailed: 'ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph '", "stdout_lines": ["=== mon.controller === ", "Starting Ceph mon.controller on controller-0...", "=== osd.0 === ", "Starting Ceph osd.0 on controller-0...", "starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal", "=== osd.1 === ", "Mounting xfs on controller-0:/var/lib/ceph/osd/ceph-1", "Starting Ceph osd.1 on controller-0...", "failed: 'ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph '"]}

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Test Activity
-------------
Developer Testing

Workaround
----------
Extract the backup archive, remove the [osd.*] sections from etc/ceph/ceph.conf, create the archive back.

Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/757520

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.5.0 stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/757520
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=943d390dd7d7d1b65f5beabd5c5f28c9221470ff
Submitter: Zuul
Branch: master

commit 943d390dd7d7d1b65f5beabd5c5f28c9221470ff
Author: Ovidiu Poncea <email address hidden>
Date: Wed Oct 7 19:18:38 2020 +0300

    Fix ceph ansible restore failure when backup is taken on controller-1

    Ansible playbook starts ceph through '/etc/init.d/ceph start'.
    This script has two detection mechanisms for osds:
     1. it looks in the ceph.conf for \[osd\.[0-9]*\] sections, grabs
        the osd id, mounts the corresponding devices and starts ceph-osd
        daemons;
     2. it lists all folders in /var/lib/ceph/osd/*, grabs the osd id
        and starts the corresponding daemon.

    When backup is taken on controller-1 it contains /etc/ceph/ceph.conf
    with osds of this node. Restore is always done from controller-0
    where ansible extracts ceph.conf from the backup to
    /etc/ceph/ceph.conf. This leads to osds from controller-1 trying to
    start on controller-0 and to ansible failure.

    To fix it we remove the osds configuration from ceph.conf. This works
    as we have code in the restore playbook that scans the disks for osds
    and mount them in /var/lib/ceph/osd/* allowing 'etc/init.d/ceph start'
    to initialize the correct ceph-osd daemons for controller-0.

    Closes-Bug: 1899444
    Change-Id: I10672613fc26807e0cf28ac8df5a08287d80c17a

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.