Unlock fails after restore when trying to resize docker-lv fs

Bug #1926591 reported by Mihnea Saracin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Mihnea Saracin

Bug Description

Brief Description
-----------------

After restore playbook was executed, the unlock failed because docker-lv resize was not successful.

Severity
--------

Critical

Steps to Reproduce
------------------

Install a system

Backup it

Run restore playbook

Unlock controller

Expected Behavior
------------------

Unlock succeeds

Actual Behavior
----------------

Unlock fails

Reproducibility
---------------

100%

System Configuration
--------------------

AIO-SX, AIO-DX

Branch/Pull Time/Commit
-----------------------

stx master build on "2021-04-21"

Last Pass
---------

2021-04-15 worked, did not test between 15-20

Timestamp/Logs
--------------

Puppet wanted to resize docker-lv from 20G to 30G and failed at unlock:

2021-04-22T04:59:09.337 Info: 2021-04-22 04:59:09 +0000 Logical_volume[docker-lv](provider=lvm): Current: value=20.0, unit=G, kibi=20971520
2021-04-22T04:59:09.342 Info: 2021-04-22 04:59:09 +0000 Logical_volume[docker-lv](provider=lvm): New: value=30.0, unit=G, kibi=31457280
2021-04-22T04:59:09.345 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvs --noheading -o vg_extent_size --units k /dev/cgts-vg/docker-lv'
2021-04-22T04:59:09.363 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvextend -L 31457280k /dev/cgts-vg/docker-lv'
2021-04-22T04:59:09.417 Debug: 2021-04-22 04:59:09 +0000 Executing: 'umount /dev/cgts-vg/docker-lv'
2021-04-22T04:59:09.425 Debug: 2021-04-22 04:59:09 +0000 Executing: 'fsadm -y check /dev/cgts-vg/docker-lv'
2021-04-22T04:59:09.471 Debug: 2021-04-22 04:59:09 +0000 Executing: 'fsadm -y resize /dev/cgts-vg/docker-lv 31457280k'
2021-04-22T04:59:09.507 Notice: 2021-04-22 04:59:09 +0000 /Stage[main]/Platform::Filesystem::Docker/Platform::Filesystem[docker-lv]/Logical_volume[docker-lv]/size: size changed '20G' to '30G'
2021-04-22T04:59:09.512 Debug: 2021-04-22 04:59:09 +0000 /Stage[main]/Platform::Filesystem::Docker/Platform::Filesystem[docker-lv]/Logical_volume[docker-lv]: The container Platform::Filesystem[docker-lv] will propagate my refresh event
2021-04-22T04:59:09.516 Debug: 2021-04-22 04:59:09 +0000 Class[Platform::Lvm::Vg::Cgts_vg]: The container Stage[main] will propagate my refresh event
2021-04-22T04:59:09.520 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvs cgts-vg'
2021-04-22T04:59:09.536 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvs --noheading --unit g /dev/cgts-vg/etcd-lv'
2021-04-22T04:59:09.560 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvs cgts-vg'
2021-04-22T04:59:09.584 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvs --noheading --unit g /dev/cgts-vg/kubelet-lv'
2021-04-22T04:59:09.611 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvs --noheading --unit g /dev/cgts-vg/kubelet-lv'
2021-04-22T04:59:09.632 Info: 2021-04-22 04:59:09 +0000 Logical_volume[kubelet-lv](provider=lvm): Current: value=2.0, unit=G, kibi=2097152
2021-04-22T04:59:09.638 Info: 2021-04-22 04:59:09 +0000 Logical_volume[kubelet-lv](provider=lvm): New: value=10.0, unit=G, kibi=10485760
2021-04-22T04:59:09.642 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvs --noheading -o vg_extent_size --units k /dev/cgts-vg/kubelet-lv'
2021-04-22T04:59:09.653 Debug: 2021-04-22 04:59:09 +0000 Executing: '/usr/sbin/lvextend -L 10485760k /dev/cgts-vg/kubelet-lv'
2021-04-22T04:59:09.710 Debug: 2021-04-22 04:59:09 +0000 Executing: 'umount /dev/cgts-vg/kubelet-lv'
2021-04-22T04:59:10.060 Debug: 2021-04-22 04:59:10 +0000 Executing: 'fsadm -y check /dev/cgts-vg/kubelet-lv'
2021-04-22T04:59:10.166 Debug: 2021-04-22 04:59:10 +0000 Executing: 'fsadm -y resize /dev/cgts-vg/kubelet-lv 10485760k'
2021-04-22T04:59:10.355 Notice: 2021-04-22 04:59:10 +0000 /Stage[main]/Platform::Filesystem::Kubelet/Platform::Filesystem[kubelet-lv]/Logical_volume[kubelet-lv]/size: size changed '2G' to '10G'
2021-04-22T04:59:10.359 Debug: 2021-04-22 04:59:10 +0000 /Stage[main]/Platform::Filesystem::Kubelet/Platform::Filesystem[kubelet-lv]/Logical_volume[kubelet-lv]: The container Platform::Filesystem[kubelet-lv] will propagate my refresh event
2021-04-22T04:59:10.364 Debug: 2021-04-22 04:59:10 +0000 Exec[wipe start of device kubelet-lv](provider=posix): Executing check 'test ! -e /etc/platform/.kubelet-lv'
2021-04-22T04:59:10.367 Debug: 2021-04-22 04:59:10 +0000 Executing: 'test ! -e /etc/platform/.kubelet-lv'
2021-04-22T04:59:10.371 Debug: 2021-04-22 04:59:10 +0000 Exec[wipe end of device kubelet-lv](provider=posix): Executing check 'test ! -e /etc/platform/.kubelet-lv'
2021-04-22T04:59:10.376 Debug: 2021-04-22 04:59:10 +0000 Executing: 'test ! -e /etc/platform/.kubelet-lv'
2021-04-22T04:59:10.379 Debug: 2021-04-22 04:59:10 +0000 Exec[mark lv as wiped kubelet-lv:](provider=posix): Executing check 'test ! -e /etc/platform/.kubelet-lv'
2021-04-22T04:59:10.382 Debug: 2021-04-22 04:59:10 +0000 Executing: 'test ! -e /etc/platform/.kubelet-lv'
2021-04-22T04:59:10.385 Debug: 2021-04-22 04:59:10 +0000 Exec[wipe start of device docker-lv](provider=posix): Executing check 'test ! -e /etc/platform/.docker-lv'
2021-04-22T04:59:10.389 Debug: 2021-04-22 04:59:10 +0000 Executing: 'test ! -e /etc/platform/.docker-lv'
2021-04-22T04:59:10.394 Debug: 2021-04-22 04:59:10 +0000 Executing: '/usr/sbin/lvs cgts-vg'
2021-04-22T04:59:10.415 Debug: 2021-04-22 04:59:10 +0000 Executing: '/usr/sbin/lvs --noheading --unit g /dev/cgts-vg/extension-lv'
2021-04-22T04:59:10.441 Debug: 2021-04-22 04:59:10 +0000 Exec[wipe end of device docker-lv](provider=posix): Executing check 'test ! -e /etc/platform/.docker-lv'
2021-04-22T04:59:10.445 Debug: 2021-04-22 04:59:10 +0000 Executing: 'test ! -e /etc/platform/.docker-lv'
2021-04-22T04:59:10.457 Debug: 2021-04-22 04:59:10 +0000 Exec[mark lv as wiped docker-lv:](provider=posix): Executing check 'test ! -e /etc/platform/.docker-lv'
2021-04-22T04:59:10.466 Debug: 2021-04-22 04:59:10 +0000 Executing: 'test ! -e /etc/platform/.docker-lv'
2021-04-22T04:59:10.470 Debug: 2021-04-22 04:59:10 +0000 Executing: '/usr/sbin/blkid /dev/cgts-vg/docker-lv'
2021-04-22T04:59:10.476 Debug: 2021-04-22 04:59:10 +0000 Executing: 'mkfs.xfs /dev/cgts-vg/docker-lv -n ftype=1'
2021-04-22T04:59:10.485 Error: 2021-04-22 04:59:10 +0000 Execution of 'mkfs.xfs /dev/cgts-vg/docker-lv -n ftype=1' returned 1: mkfs.xfs: /dev/cgts-vg/docker-lv contains a mounted filesystem
2021-04-22T04:59:10.489 Usage: mkfs.xfs
2021-04-22T04:59:10.492 /* blocksize */ [-b log=n|size=num]

Firstly, the resize should not be triggered here, in an older load (2021-04-15 )the docker-lv was 30G before unlock.

For some reason docker-lv fs type is drbd and I think it's wrong, it should be xfs:

controller-0:/var/log# blkid /dev/cgts-vg/docker-lv
/dev/cgts-vg/docker-lv: UUID="c5c72dc8a5af335b" TYPE="drbd"

Possible similar bug: https://bugs.launchpad.net/starlingx/+bug/1883825

Test Activity
-------------

Developer Testing

Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Mihnea, is this an issue with the r/stx.5.0 branch as well?

tags: added: stx.update
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/788748
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/255488739efa4ac072424b19f2dbb7a3adb0254e
Submitter: "Zuul (22348)"
Branch: master

commit 255488739efa4ac072424b19f2dbb7a3adb0254e
Author: Mihnea Saracin <email address hidden>
Date: Thu Apr 29 16:37:21 2021 +0300

    Restore host filesystems with collected sizes

    Since https://review.opendev.org/c/starlingx/ansible-playbooks/+/784860,
    the host filesystems(backup, docker, kubelet, scratch) are
    no longer resized in ansible at restore and they are not using the
    collected sizes from the backup archive. Puppet will try to
    resize them when unlocking but this will generate some errors.

    The solution is to create the host filesystems with the
    correct sizes at restore. The sizes are taken from the
    backup archive.

    Closes-Bug: 1926591
    Change-Id: Id670408a518e4a1e3fc75a668eea42d26a972d66
    Signed-off-by: Mihnea Saracin <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: Marking as stx.5.0 / high priority given the issue was introduced by recent code changes (see review referenced in the commit msg above) which were cherrypicked to the r/stx.5.0 branch

Changed in starlingx:
importance: Undecided → High
tags: added: stx.5.0 stx.cherrypickneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Mihnea, please cherrypick to the r/stx.5.0 release branch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (r/stx.5.0)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/788999
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/490874f7bbd60f0117aa08d5a5fd582670d801b6
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 490874f7bbd60f0117aa08d5a5fd582670d801b6
Author: Mihnea Saracin <email address hidden>
Date: Thu Apr 29 16:37:21 2021 +0300

    Restore host filesystems with collected sizes

    Since https://review.opendev.org/c/starlingx/ansible-playbooks/+/784860,
    the host filesystems(backup, docker, kubelet, scratch) are
    no longer resized in ansible at restore and they are not using the
    collected sizes from the backup archive. Puppet will try to
    resize them when unlocking but this will generate some errors.

    The solution is to create the host filesystems with the
    correct sizes at restore. The sizes are taken from the
    backup archive.

    Closes-Bug: 1926591
    Change-Id: Id670408a518e4a1e3fc75a668eea42d26a972d66
    Signed-off-by: Mihnea Saracin <email address hidden>
    (cherry picked from commit 255488739efa4ac072424b19f2dbb7a3adb0254e)

Ghada Khalil (gkhalil)
tags: added: in-r-stx50
removed: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/792723

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/792723
Committed: https://opendev.org/starlingx/integ/commit/b310077093fd567944c6a46b7d0adcabe1f2b4b9
Submitter: "Zuul (22348)"
Branch: master

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

    Revert "Restore host filesystems with collected sizes"

    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

    Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

    Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

    Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

    This change adds a new task which brings
    the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

    Support boo...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.