Initial host-unlock fails on SystemController due to "Failed to make drbd platform primary"

Bug #1902232 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
John Kung

Bug Description

Brief Description
-----------------
After ansible bootstrap, on the host-unlock, "controller_config[9109]: Failed to make drbd-platform primary" is observed resulting in error bringing up drbd filesystems. This is reproducible on SystemController.

Severity
--------
Provide the severity of the defect.

Major: System is usable after workaround.

Steps to Reproduce
------------------
Install load and attempt to host-unlock. This is reproducible on SystemController.

Expected Behavior
------------------
The host-unlock should succeed and drbd filesystems enabled as is observable via 'sudo drbd-overview'

Actual Behavior
----------------
After ansible-playbook and host-unlock, 'sudo drbd-overview' fails to display any resources.

Reproducibility
---------------

Reproducible on SystemController.

System Configuration
--------------------
Two node system, IPv4/IPv6.

Branch/Pull Time/Commit
-----------------------
stx5.0 2020-10-19_00-00-10

Last Pass
---------
Unknown; passing in stx4.0. This is not a new test scenario.

Timestamp/Logs
--------------

There are no Error logs of note in puppet nor ansible.log.
The following occurs on host-unlock after ansible-playbook:
sw-patch-controller-daemon...done.
2020-10-20T04:02:42.190 controller-0 drbd[10034]: info adjust disk: drbd-dockerdistribution drbd-etcd drbd-extension drbd-pgsql drbd-platform:failed(apply-al:255) drbd-rabbit
2020-10-20T04:02:42.197 controller-0 drbd[10034]: info adjust net: drbd-dockerdistribution drbd-etcd drbd-extension drbd-pgsql drbd-platform drbd-rabbit
2020-10-20T04:02:42.197 controller-0 drbd[10034]: info ]

Test Activity
-------------
Regression Testing

Workaround
----------
2 Alternatives:
1) Procedural Workaround:
After ansible-playbook bootstrap, and before the first host-unlock.
systemctl restart drbd.service
OR
2) Code workaround:
WorkAround a) Prior to ansible-playbook bootstrap, modify drbd.pp
$ diff /usr/share/puppet/modules/platform/manifests/drbd.pp.orig /usr/share/puppet/modules/platform/manifests/drbd.pp
171c171
< $lv_size = '1',

> $lv_size = '10',

I.e. modify: to 10 as per the following:

class platform::drbd::platform::params (
$device = '/dev/drbd2',
$lv_name = 'platform-lv',
$lv_size = '10',
$mountpoint = '/opt/platform',
$port = '7790',
$vg_name = 'cgts-vg',
$resource_name = 'drbd-platform',
) {}

John Kung (john-kung)
Changed in starlingx:
assignee: nobody → John Kung (john-kung)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/760551

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / high - DC failure is the master branch only

Changed in starlingx:
importance: Undecided → High
tags: added: stx.5.0 stx.config stx.distcloud
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/760551
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=795b8f617867627a8d82def183d6a680f9bfd5d3
Submitter: Zuul
Branch: master

commit 795b8f617867627a8d82def183d6a680f9bfd5d3
Author: John Kung <email address hidden>
Date: Fri Oct 30 09:20:17 2020 -0400

    Wait for drbdadm prior to resize2fs filesystem

    After ansible bootstrap, on the host-unlock,
    "Failed to make drbd-platform primary" is observed in daemon.log resulting
    in error bringing up drbd filesystems.
    This is reproducible on SystemController.

    Update to perform, drbd-resize on specific resource. After issuing
    the drbdadm resize, a pause is also required prior to performing the
    resize2fs operation.

    There does not appear to be much observability into drbdadm resize
    at /proc/drbd or drbd-overview, so a sleep was introduced. The pause needed
    to be at least 1 second as per observations in virtual and hardware labs,
    AIO and Standard controllers.

    Change-Id: I7b9907092350bf677df30b1d7c54915711791fb2
    Closes-Bug: 1902232
    Signed-off-by: John Kung <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.