Storage-services loss of redundancy after lock/unlock of standby controller

Bug #1928934 reported by Mihnea Saracin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Mihnea Saracin

Bug Description

Brief Description
-----------------

After performing a lock/unlock on standby controller, Service group storage-services loss of redundancy; expected 2 active members but only 1 active member available.

Severity
--------

Major

Steps to Reproduce
------------------

testcases/functional/mtc/test_lock_unlock_host.py::test_lock_unlock_host[controller]

Expected Behavior
------------------

no storage-services loss of redundancy after lock/unlock standby controller

Actual Behavior
----------------

storage-services loss of redundancy after lock/unlock standby controller

Reproducibility
---------------

Seen once
testcases/functional/mtc/test_lock_unlock_host.py::test_lock_unlock_host[controller]

System Configuration
--------------------

AIO-DX

Branch/Pull Time/Commit
-----------------------

stx master built on 2021-04-28_00-00-08

Last Pass
---------

stx master built on 2021-04-24_00-00-09

Timestamp/Logs
--------------

[2021-04-29 07:24:36,633] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...

[2021-04-29 07:24:36,633] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-1'
[2021-04-29 07:25:00,842] 436 DEBUG MainThread ssh.expect :: Output:
-----------------------------------------------------------------+

Property Value
-----------------------------------------------------------------+

action none
administrative locked
availability online
bm_ip 128.224.64.71
bm_type dynamic
bm_username root
boot_device /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0
capabilities {u'stor_function': u'monitor'}
clock_synchronization ntp
config_applied 24aada95-871b-405d-a693-ceacc2954a6b
config_status None
config_target 24aada95-871b-405d-a693-ceacc2954a6b
console ttyS0,115200n8
created_at 2021-04-29T05:35:14.179924+00:00
device_image_update None
hostname controller-1
id 2
install_output text
install_state completed
install_state_info None
inv_state inventoried
invprovision provisioned
location {}
mgmt_ip 192.168.204.3
mgmt_mac 90:e2:ba:b8:90:65
operational disabled
personality controller
reboot_needed False
reserved False
rootfs_device /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0
serialid None
software_load 21.05
subfunction_avail online
subfunction_oper disabled
subfunctions controller,worker
task Unlocking
tboot false
ttys_dcd None
updated_at 2021-04-29T07:24:30.689139+00:00
uptime 5021
uuid d37b52b5-00da-45de-b565-9da8523d254e
vim_progress_status services-disabled
-----------------------------------------------------------------+

[2021-04-29 07:35:13,586] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2021-04-29 07:35:13,587] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2021-04-29 07:35:15,506] 436 DEBUG MainThread ssh.expect :: Output:
----------------------------------------------------------------+

id hostname personality administrative operational availability
----------------------------------------------------------------+

1 controller-0 controller unlocked enabled available
2 controller-1 controller unlocked enabled available
----------------------------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[2021-04-29 07:35:15,506] 314 DEBUG MainThread ssh.send :: Send 'echo $?'
[2021-04-29 07:35:15,567] 436 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-0 ~(keystone_admin)]$
[2021-04-29 07:35:15,568] 344 DEBUG MainThread system_helper.get_hosts:: Filtered hosts: {'availability': ['available']}
[2021-04-29 07:35:16,569] 3939 INFO MainThread system_helper.wait_for_hosts_states:: ['controller-1'] have reached state(s): {'availability': ['available']}
[2021-04-29 07:35:16,570] 578 INFO MainThread host_helper.wait_for_task_clear_and_subfunction_ready:: Waiting for task clear and subfunctions enable/available (if applicable) for hosts: ['controller-1']
[2021-04-29 07:35:16,571] 1657 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_69_70
[2021-04-29 07:35:16,571] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2021-04-29 07:35:16,571] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-1'
[2021-04-29 07:35:18,466] 436 DEBUG MainThread ssh.expect :: Output:
--------------------------------------------------------------------------------------------+

Property Value
--------------------------------------------------------------------------------------------+

action none
administrative unlocked
availability available
bm_ip 128.224.64.71
bm_type dynamic
bm_username root
boot_device /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0
capabilities {u'stor_function': u'monitor', u'Personality': u'Controller-Standby'}
clock_synchronization ntp
config_applied 24aada95-871b-405d-a693-ceacc2954a6b
config_status None
config_target 24aada95-871b-405d-a693-ceacc2954a6b
console ttyS0,115200n8
created_at 2021-04-29T05:35:14.179924+00:00
device_image_update None
hostname controller-1
id 2
install_output text
install_state completed
install_state_info None
inv_state inventoried
invprovision provisioned
location {}
mgmt_ip 192.168.204.3
mgmt_mac 90:e2:ba:b8:90:65
operational enabled
personality controller
reboot_needed False
reserved False
rootfs_device /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0
serialid None
software_load 21.05
subfunction_avail available
subfunction_oper enabled
subfunctions controller,worker
task
tboot false
ttys_dcd None
updated_at 2021-04-29T07:34:55.509782+00:00
uptime 449
uuid d37b52b5-00da-45de-b565-9da8523d254e
vim_progress_status services-enabled
--------------------------------------------------------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[2021-04-29 07:35:18,467] 314 DEBUG MainThread ssh.send :: Send 'echo $?'
[2021-04-29 07:35:18,525] 436 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-0 ~(keystone_admin)]$
[2021-04-29 07:35:18,526] 592 INFO MainThread host_helper.wait_for_task_clear_and_subfunction_ready:: Hosts task cleared and subfunctions (if applicable) are now in enabled/available states
[2021-04-29 07:35:18,527] 535 INFO MainThread host_helper.wait_for_hosts_ready:: Wait for webservices up for hosts: ['controller-1']
[2021-04-29 07:35:18,527] 1914 INFO MainThread host_helper.wait_for_webservice_up:: Waiting for ['controller-1'] to be active for web-service in system servicegroup-list...
[2021-04-29 07:35:18,528] 1657 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_69_70
[2021-04-29 07:35:18,528] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2021-04-29 07:35:18,528] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne servicegroup-list'
[2021-04-29 07:35:20,992] 436 DEBUG MainThread ssh.expect :: Output:
-------------------------------------------------------------------------------------------+

uuid service_group_name hostname state
-------------------------------------------------------------------------------------------+

7cf5c2df-0006-4a63-bb87-6069ea912a82 cloud-services controller-0 active
41e0922b-57f9-470b-97cd-8c413ef0b2eb cloud-services controller-1 standby
0da0d27a-4510-4e21-8ba0-5024f54b3860 controller-services controller-0 active
5b10f6cb-45b6-4533-ba44-f7553ec949e6 controller-services controller-1 standby
eb3b6103-88d6-471e-a9bb-bd30a1a101c2 directory-services controller-0 active
c73c4acc-6fac-4b18-9e63-f749df79dda7 directory-services controller-1 active
2852293e-ffc3-466d-a3ca-dd4968e23673 oam-services controller-0 active
7b257031-8382-43e7-ac1e-c3d1cba64edd oam-services controller-1 standby
41171fcf-f683-4355-ae19-063afdf8525d patching-services controller-0 active
739c5cc0-6443-4a9c-9096-06156690d8f1 patching-services controller-1 standby
14f476f5-be4a-4090-a529-7ce187d80681 storage-monitoring-services controller-0 active
eaaa21d7-9d83-413c-94ff-fb3b952fac0c storage-monitoring-services controller-1 standby
9f3003ab-5850-4eb5-ae1b-d644713d89a6 storage-services controller-0 active
cefc3b74-1609-4580-851a-6afb60827a9b storage-services controller-1 go-active-warn
21d04677-9fc8-4bd3-afa4-232d5879f8ef vim-services controller-0 active
60063eb6-28a5-497b-9a8a-6057c1d36671 vim-services controller-1 standby
d4c641cb-7fcc-4e6c-a47e-e915298d611b web-services controller-0 active
9daa600f-bcef-4f55-b197-93f10afb8ba5 web-services controller-1 active
-------------------------------------------------------------------------------------------+

[2021-04-29 07:35:27,134] 1657 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_69_70
[2021-04-29 07:35:27,134] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2021-04-29 07:35:27,135] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2021-04-29 07:35:28,995] 436 DEBUG MainThread ssh.expect :: Output:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

UUID Alarm ID Reason Text Entity ID Severity Time Stamp
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

e54195fd-9f15-4c73-9213-80a2c7287215 400.001 Service group storage-services warning; ceph-osd(disabling, failed) service_domain=controller.service_group=storage-services.host=controller-1 minor 2021-04-29T07:35:26.247764
550a64b4-0f7c-4bf2-b6bc-c5a701dc9460 750.004 Application Apply In Progress k8s_application=stx-openstack warning 2021-04-29T07:35:04.438867
082c3b90-7eb9-4ce6-95bc-4508c9f85c87 400.002 Service group storage-services loss of redundancy; expected 2 active members but only 1 active member available service_domain=controller.service_group=storage-services major 2021-04-29T07:23:13.921793
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Test Activity
-------------
Sanity

Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792143

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (master)

Change abandoned by "Mihnea Saracin <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792143

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/792361

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/792361
Committed: https://opendev.org/starlingx/integ/commit/3225570530458956fd642fa06b83360a7e4e2e61
Submitter: "Zuul (22348)"
Branch: master

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a6033822639926
    Signed-off-by: Mihnea Saracin <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.6.0 / medium - intermittent issue; fixed in the active branch

tags: added: stx.6.0 stx.storage
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.