Backup & Restore: System Backup fails without Ceph backend configuration

Bug #1873974 reported by Senthil Mukundakumar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
Backup fails with no ceph on system

E TASK [backup/backup-system : Name ceph crushmap backup] **************************************************************************************************************************************************************************************************
E ok: [localhost]
E
E TASK [backup/backup-system : Create ceph crushmap backup] ************************************************************************************************************************************************************************************************
E fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["ceph", "osd", "getcrushmap", "-o", "/opt/backups/ansible.CmNwfv/ceph/crushmap.bin.backup"], "delta": "0:00:00.137620", "end": "2020-04-20 22:07:31.316105", "msg": "non-zero return code", "rc": 1, "start": "2020-04-20 22:07:31.178485", "stderr": "unable to get monitor info from DNS SRV with service name: ceph-mon\nno monitors specified to connect to.\n2020-04-20 22:07:31.293 7f4b0e024700 -1 failed for service _ceph-mon._tcp\n2020-04-20 22:07:31.295 7f4b0e024700 -1 monclient: get_monmap_and_config cannot identify monitors to contact\n[errno 2] RADOS Object Not Found error (error connecting to the cluster)", "stderr_lines": ["unable to get monitor info from DNS SRV with service name: ceph-mon", "no monitors specified to connect to.", "2020-04-20 22:07:31.293 7f4b0e024700 -1 failed for service _ceph-mon._tcp", "2020-04-20 22:07:31.295 7f4b0e024700 -1 monclient: get_monmap_and_config cannot identify monitors to contact", "[errno 2] RADOS Object Not Found error (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
E
E TASK [backup/backup-system : Remove the temp dir]

Severity
--------
Major: System failed to backup

Steps to Reproduce
------------------
1. Boot AIO-SX system
2. Backup from active controller
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=Li69nux* admin_password=Li69nux*"

Expected Behavior
------------------
Back of system with no ceph should be successful

Actual Behavior
----------------
Backup fail in system with no ceph configured

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-SX (SM-3)

Branch/Pull Time/Commit
-----------------------
2020-04-19_20-00-00
Timestamp/Logs
--------------

Test Activity
-------------
Regression Testing

tags: added: stx.retestneeded
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

There is also an issue at restore - we should always restore w/ wipe_ceph_osds: true and a test attempted.

Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721986

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue with B&R which is an stx.4.0 deliverable

tags: added: stx.4.0 stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/721814
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=dd89ba118d21027da28f860f2da47e6794d0453b
Submitter: Zuul
Branch: master

commit dd89ba118d21027da28f860f2da47e6794d0453b
Author: Dan Voiculeasa <email address hidden>
Date: Wed Apr 22 13:32:21 2020 +0300

    Fix backup without ceph backend

    When ceph backend is not configured there is no ceph crushmap to be
    backed up. Skip the crushmap backup step.

    Partial-Bug: 1873974
    Change-Id: Ic2b7a77f4a54d3d30aedd6c00747fc4586428997
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/721986
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=54e9b94773f3ae9c6be7eb14e141537cad373915
Submitter: Zuul
Branch: master

commit 54e9b94773f3ae9c6be7eb14e141537cad373915
Author: Dan Voiculeasa <email address hidden>
Date: Wed Apr 22 15:44:15 2020 +0300

    Fix restore without ceph backend

    When ceph backend is not configured there is no ceph crushmap to be
    restored, nor ceph monitors data. Skip restoring those.

    The rest of the logic regarding ceph osds can be treaded as if osds were
    wiped.

    Closes-Bug: 1873974
    Depends-On: Ic2b7a77f4a54d3d30aedd6c00747fc4586428997
    Change-Id: I2776d7c2d5801ce6e81c487da263075b6f6873c8
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

    Restore: disconnect etcd from ceph

    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

    Add playbook for updating static images

    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

    Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

    Change-Id: I83c43c52a77...

tags: added: in-f-centos8
Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

Verified using sm-1 using 2020-06-10_20-00-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.