standby controller was in reboot loop after modifying the partition size

Bug #1809009 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Wei Zhou

Bug Description

Bug Description : During the regression testing standby controller was in reboot loop after modifying the partition size for cinder volume and swact controller extra and rebooting the active controller.

2018-12-10T16:17:34.886 ^[[mNotice: 2018-12-10 16:17:34 +0000 /Stage[main]/Platform::Partitions/Platform_manage_partition[modify]/Exec[manage-partitions-modify]/returns: Called partition 'modify' with 'None' 'None' and '[{"start_mib": 1, "current_uuid": "2a905d0e-aef1-4408-ba13-467ca8e65b8d", "part_device_path": "/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:12:0-part1", "new_size_mib": 87040, "req_guid": "ba5eba11-0000-1111-2222-000000000001", "ihost_uuid": "967c7c16-badd-4a1a-aaf7-50cc61db1919"}]'^[[0m
2018-12-10T16:17:34.888 ^[[mNotice: 2018-12-10 16:17:34 +0000 /Stage[main]/Platform::Partitions/Platform_manage_partition[modify]/Exec[manage-partitions-modify]/returns: unable to open database file^[[0m
2018-12-10T16:17:34.891 ^[[1;31mError: 2018-12-10 16:17:34 +0000 /bin/true # puppet requires this for correct template parsing
2018-12-10T16:17:34.893
2018-12-10T16:17:34.896 sm-unmanage service drbd-cinder
2018-12-10T16:17:34.898
2018-12-10T16:17:34.901
2018-12-10T16:17:34.904 DRBD_UNCONFIGURED_TIMEOUT=180
2018-12-10T16:17:34.906 DRBD_UNCONFIGURED_DELAY=0
2018-12-10T16:17:34.909 while [[ $DRBD_UNCONFIGURED_DELAY -lt $DRBD_UNCONFIGURED_TIMEOUT ]]; do
2018-12-10T16:17:34.912 drbdadm down drbd-cinder
2018-12-10T16:17:34.914 drbd_info=$(drbd-overview | grep drbd-cinder | awk '{print $2}')
2018-12-10T16:17:34.917
2018-12-10T16:17:34.919 if [[ ${drbd_info} == "Unconfigured" ]]; then
2018-12-10T16:17:34.921 break
2018-12-10T16:17:34.924 else
2018-12-10T16:17:34.926 sleep 2
2018-12-10T16:17:34.929 DRBD_UNCONFIGURED_DELAY=$((DRBD_UNCONFIGURED_DELAY + 2))
2018-12-10T16:17:34.931 fi
2018-12-10T16:17:34.933 done
2018-12-10T16:17:34.936
2018-12-10T16:17:34.938 if [[ DRBD_UNCONFIGURED_DELAY -eq DRBD_UNCONFIGURED_TIMEOUT ]]; then
2018-12-10T16:17:34.940 exit 40
2018-12-10T16:17:34.943 fi
2018-12-10T16:17:34.945
2018-12-10T16:17:34.948
2018-12-10T16:17:34.950 manage-partitions modify '[{"start_mib": 1, "current_uuid": "2a905d0e-aef1-4408-ba13-467ca8e65b8d", "part_device_path": "/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:12:0-part1", "new_size_mib": 87040, "req_guid": "ba5eba11-0000-1111-2222-000000000001", "ihost_uuid": "967c7c16-badd-4a1a-aaf7-50cc61db1919"}]'
2018-12-10T16:17:34.952
2018-12-10T16:17:34.955 drbdadm up drbd-cinder || exit 30
2018-12-10T16:17:34.957
2018-12-10T16:17:34.959
2018-12-10T16:17:34.962 sm-manage service drbd-cinder
2018-12-10T16:17:34.964 returned 255 instead of one of [0]
Severity
--------
Major

Steps to Reproduce
------------------
1. Modify partition size for cinder-volumes for standby controller
2. Swact controllers
3. Modify partition size for cinder-volumes for the newly standby controller
4. Make sure cinder-volumes lvg is resized successfully and no alarms are generated
5. Swact controllers again
6. Reboot active controller (Run “sudo reboot” on the active controller)

Expected Behavior
------------------
No reboot

Actual Behavior
----------------
As per description

Reproducibility
---------------
100% Reproduce-able

System Configuration
--------------------
duplex system

Branch/Pull Time/Commit
-----------------------

Timestamp/Logs
--------------
2018-10-26T15:01:03.000
------------------- Titanium-specific Information -------------------
* Test case mapping: Regression

* Test-case Title: Verify cloud patch orchestration from horizon
* Lab Name: Distributed cloud lab main cloud is WCP90-91

* Load Information:

Build Server: yow-cgts1-lx
SW_VERSION="18.10"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2018-10-24_20-18-00"
SRC_BUILD_ID="116"

JOB="StarlingX_Upstream_build"
BUILD_BY="jenkins"
BUILD_NUMBER="116"
BUILD_HOST="yow-cgts1-lx"

* Logs location: /folk/cgts_logs/logs/CGTS-10402

* When and where was the last time this test-case passed?..

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/625949
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=0d958de077251936a945dc71546ee53f924ad345
Submitter: Zuul
Branch: master

commit 0d958de077251936a945dc71546ee53f924ad345
Author: Wei Zhou <email address hidden>
Date: Tue Dec 18 11:17:36 2018 -0500

    Standby controller in reboot loop after modifying the partition size

    The root cause of this issue is that when the newly standby controller
    is coming up, the partition manifest calls "sm-unmanage / sm-manage"
    for drbd-cinder service. However at this time /var/run/sm.db is not
    available yet which causes "sm-unmanage / sm-manage" to fail.

    This commit adds a check in the partition puppet file that only
    if the system is up "sm-unmanage" and "sm-manage" will be called.

    Change-Id: I1869f024579350265684f2ee3bb5e3e74e6427cb
    Closes-bug: 1809009
    Signed-off-by: Wei Zhou <email address hidden>

Changed in starlingx:
status: New → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Wei Zhou (wzhou007)
importance: Undecided → Medium
tags: added: stx.2019.03 stx.config
Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.