StarlingX

simplex: host did not become active controller after initial unlock

Bug #1883825 reported by Yang Liu on 2020-06-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Elena Taivan

Bug Description

Brief Description
-----------------
One simplex system did not become active after initial unlock.

controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

Severity
--------
Major

Steps to Reproduce
------------------
- Install simplex
- Bootstrap
- Configure and unlock the node

Expected Behavior
------------------
- Host unlocked successfully and became active controller

Actual Behavior
----------------
- Host did not become active controller

controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-0:~$ sudo sm-dump
Password:
/var/run/sm/sm.db not available.

Reproducibility
---------------
Reproducible on this system (wcp122). Have not seen on other simplex system (wcp112, ml350-g10-1)

System Configuration
--------------------
One node system
Lab-name: wcp122

Branch/Pull Time/Commit
-----------------------
2020-06-15_20-00-00

Last Pass
---------
2020-06-10_20-00-00

Timestamp/Logs
--------------
https://files.starlingx.kube.cengn.ca/launchpad/1883825

Some error from puppet log, but I'm NOT sure if it's related.

2020-06-16T23:52:09.503 Warning: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Ceph::Post/File[/etc/platform/.node_ceph_configured]: Skipping because of failed dependencies
2020-06-16T23:52:09.508 Notice: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Config::Controller::Post/File[/var/run/.controller_config_complete]: Dependency Filesystem[/dev/cgts-vg/backup-lv] has failures: true
2020-06-16T23:52:09.510 Warning: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Config::Controller::Post/File[/var/run/.controller_config_complete]: Skipping because of failed dependencies
2020-06-16T23:52:09.513 Notice: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Config::Controller::Post/File[/etc/platform/.initial_config_complete]: Dependency Filesystem[/dev/cgts-vg/backup-lv] has failures: true
2020-06-16T23:52:09.515 Warning: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Config::Controller::Post/File[/etc/platform/.initial_config_complete]: Skipping because of failed dependencies
2020-06-16T23:52:09.519 Notice: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Config::Post/File[/etc/platform/.config_applied]: Dependency Filesystem[/dev/cgts-vg/backup-lv] has failures: true
2020-06-16T23:52:09.520 Warning: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Config::Post/File[/etc/platform/.config_applied]: Skipping because of failed dependencies
2020-06-16T23:52:09.526 Notice: 2020-06-16 23:52:09 +0000 /Stage[post]/Platform::Config::Controller::Post/File[/etc/platform/.initial_controller_config_complete]: Dependency Filesystem[/dev/cgts-vg/backup-lv] has failures: true

Test Activity
-------------
Sanity

See original description

Tags:

Yang Liu (yliu12) on 2020-06-17

description:

updated

Ghada Khalil (gkhalil) on 2020-06-17

tags:	added: stx.4.0 stx.storage
Changed in starlingx:
importance:	Undecided → High
importance:	High → Medium
status:	New → Triaged
assignee:	nobody → Ovidiu Poncea (ovidiu.poncea)

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2020-06-18:

Download full text (4.4 KiB)

The error is this:

2020-06-18T05:40:05.471 ^[[0;32mInfo: 2020-06-18 05:40:05 +0000 /Stage[main]/Ceph/Ceph_config[global/osd_pool_default_size]: Scheduling refresh of Service[ceph-mon-controller-0]^[[0m
2020-06-18T05:40:05.473 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 /Stage[main]/Ceph/Ceph_config[global/osd_pool_default_size]: The container Class[Ceph] will propagate my refresh event^[[0m
2020-06-18T05:40:05.475 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 /Stage[main]/Ceph/Ceph_config[global/osd_crush_update_on_start]: Nothing to manage: no ensure and the resource doesn't exist^[[0m
2020-06-18T05:40:05.477 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 /Stage[main]/Ceph/Ceph_config[global/cluster_require_signatures]: Nothing to manage: no ensure and the resource doesn't exist^[[0m
2020-06-18T05:40:05.479 ^[[mNotice: 2020-06-18 05:40:05 +0000 /Stage[main]/Ceph/Ceph_config[global/ms_bind_ipv6]/ensure: created^[[0m
2020-06-18T05:40:05.481 ^[[0;32mInfo: 2020-06-18 05:40:05 +0000 /Stage[main]/Ceph/Ceph_config[global/ms_bind_ipv6]: Scheduling refresh of Service[ceph-mon-controller-0]^[[0m
2020-06-18T05:40:05.483 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 /Stage[main]/Ceph/Ceph_config[global/ms_bind_ipv6]: The container Class[Ceph] will propagate my refresh event^[[0m
2020-06-18T05:40:05.485 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 Executing: '/usr/sbin/lvs cgts-vg'^[[0m
2020-06-18T05:40:05.487 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 Executing: '/usr/sbin/lvs --noheading --unit g /dev/cgts-vg/backup-lv'^[[0m
2020-06-18T05:40:05.489 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 Executing: '/usr/sbin/blkid /dev/cgts-vg/backup-lv'^[[0m
2020-06-18T05:40:05.491 ^[[0;36mDebug: 2020-06-18 05:40:05 +0000 Executing: 'mkfs.ext4 /dev/cgts-vg/backup-lv'^[[0m
2020-06-18T05:40:05.493 ^[[1;31mError: 2020-06-18 05:40:05 +0000 Execution of 'mkfs.ext4 /dev/cgts-vg/backup-lv' returned 1: mke2fs 1.42.9 (28-Dec-2013)
2020-06-18T05:40:05.495 /dev/cgts-vg/backup-lv is mounted; will not make a filesystem here!

The two main questions we need answwers to are: Why would Ceph trigger a refresh event on backup-lv? And, even if it does so, why is it not idempotent?

Looking over the manifests executed:
controller-0:/home/sysadmin# ls -latr /var/log/puppet/latest
lrwxrwxrwx 1 root root 46 Jun 18 05:38 /var/log/puppet/latest -> /var/log/puppet/2020-06-18-05-38-46_controller
controller-0:/home/sysadmin# ls -latr /var/log/puppet
total 272
drwxr-xr-x 2 root root 4096 Jun 18 05:15 2020-06-18-05-15-08_controller
-rw-r--r-- 1 root root 231555 Jun 18 05:17 first_apply.tgz
drwxrwxrwx 2 root root 4096 Jun 18 05:18 2020-06-18-05-18-22_controller
drwxrwxrwx 2 root root 4096 Jun 18 05:20 2020-06-18-05-20-18_worker
drwxrwxrwx 2 root root 4096 Jun 18 05:33 2020-06-18-05-33-07_worker
drwxrwxrwx 2 root root 4096 Jun 18 05:33 2020-06-18-05-33-29_worker
drwxrwxrwx 2 root root 4096 Jun 18 05:33 2020-06-18-05-33-49_worker
drwxrwxrwx 2 root root 4096 Jun 18 05:34 2020-06-18-05-34-03_worker
drwxrwxrwx 2 root root 4096 Jun 18 05:34 2020-06-18-05-34-11_controller
lrwxrwxrwx 1 root root 46 Jun 18 05:38 latest -> /var/log/puppet/2020-06-18-...

The error is this:

The two main questions we need answwers to are: Why would Ceph trigger a refresh event on backup-lv? And, even if it does so, why is it not idempotent?

Looking over the manifests executed:
controller-0:/home/sysadmin# ls -latr /var/log/puppet/latest
lrwxrwxrwx 1 root root 46 Jun 18 05:38 /var/log/puppet/latest -> /var/log/puppet/2020-06-18-05-38-46_controller
controller-0:/home/sysadmin# ls -latr /var/log/puppet
total 272
drwxr-xr-x   2 root   root     4096 Jun 18 05:15 2020-06-18-05-15-08_controller
-rw-r--r--   1 root   root   231555 Jun 18 05:17 first_apply.tgz
drwxrwxrwx   2 root   root     4096 Jun 18 05:18 2020-06-18-05-18-22_controller
drwxrwxrwx   2 root   root     4096 Jun 18 05:20 2020-06-18-05-20-18_worker
drwxrwxrwx   2 root   root     4096 Jun 18 05:33 2020-06-18-05-33-07_worker
drwxrwxrwx   2 root   root     4096 Jun 18 05:33 2020-06-18-05-33-29_worker
drwxrwxrwx   2 root   root     4096 Jun 18 05:33 2020-06-18-05-33-49_worker
drwxrwxrwx   2 root   root     4096 Jun 18 05:34 2020-06-18-05-34-03_worker
drwxrwxrwx   2 root   root     4096 Jun 18 05:34 2020-06-18-05-34-11_controller
lrwxrwxrwx   1 root   root       46 Jun 18 05:38 latest -> /var/log/puppet/2020-06-18-05-38-46_controller
drwxr-x---. 11 puppet puppet   4096 Jun 18 05:38 .
drwxr-xr-x. 29 root   root     4096 Jun 18 15:04 ..
drwxr-xr-x   2 root   root     4096 Jun 18 15:54 2020-06-18-05-38-46_controller

I see many worker manifests exwecuted just before the controller ones, did we change the default behavior and we are now executing worker manifests before controller manifests on unlock?

Looking over puppet logs:
controller-0:/var/log/puppet# grep -rnH "mkfs.ext4 /dev/cgts-vg/backup-lv"
2020-06-18-05-38-46_controller/puppet.log:23757:2020-06-18T05:40:05.491 Debug: 2020-06-18 05:40:05 +0000 Executing: 'mkfs.ext4 /dev/cgts-vg/backup-lv'
2020-06-18-05-38-46_controller/puppet.log:23758:2020-06-18T05:40:05.493 Error: 2020-06-18 05:40:05 +0000 Execution of 'mkfs.ext4 /dev/cgts-vg/backup-lv' returned 1: mke2fs 1.42.9 (28-Dec-2013)
2020-06-18-05-38-46_controller/puppet.log:23807:2020-06-18T05:40:05.591 Error: 2020-06-18 05:40:05 +0000 /Stage[main]/Platform::Filesystem::Backup/Platform::Filesystem[backup-lv]/Filesystem[/dev/cgts-vg/backup-lv]/ensure: change from absent to present failed: Execution of 'mkfs.ext4 /dev/cgts-vg/backup-lv' returned 1: mke2fs 1.42.9 (28-Dec-2013)

two executions - backup-lv is exceuted with "ensure" therefore it's already present and should not be re-executed.

Frank Miller (sensfan22) on 2020-06-22

Changed in starlingx:
assignee:	Ovidiu Poncea (ovidiu.poncea) → Elena Taivan (etaivan)

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2020-06-25:

# Workaround:
Wipe backup-lv with: dd if=/dev/zero of=/dev/cgts-vs/backup-lv bs=1M
then reinstall & configure node.

# Why does the issue happen?

There is stale data on the volume that leads to the failure of lvm/filesystem puppet resource.

Code that fails is:
./cgcs-root/stx/stx-puppet/puppet-manifests/src/modules/platform/manifests/filesystem.pp
    # create filesystem
    -> filesystem { $device:
      ensure => $ensure,
      fs_type => $fs_type,
      options => $fs_options,
    }

Looking over the puppet component code in https://github.com/puppetlabs/puppetlabs-lvm/blob/master/lib/puppet/provider/filesystem/lvm.rb we see that it uses 'blkid' to check for the ensure => $ensure and, if the partition is not of desired $fs_type it runs 'mkfs'.

W/o the issue 'filesystem' shouldn't do anything here.

Since, from the logs above, mkfs was executed in spite backup-lv already had of ext4 type the assumption is that blkid did not correctly detect the type although the linux kernel did it.

Later, after wiping backup-lv, execution didn't fail anymore leads us to conclude that the problem is caused bay stale data on the disk. This is reinforced by similar issues in the past with partitions and partition & partition resizes which led us to wipe them at the beginning and at the end.

# Solution to avoid cases like this:
Wipe 10MB at the beginning and 10MB at the end of each LV in cgts-vg immediately after we create them. Creation happens in ansible.

Ghada Khalil (gkhalil) on 2020-06-28

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-06: Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739410

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-06: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739411

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-09: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/739410
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=ec40a9cdf64f4013aa3e6357b94a83680a15fa8d
Submitter: Zuul
Branch: master

commit ec40a9cdf64f4013aa3e6357b94a83680a15fa8d
Author: Elena Taivan <email address hidden>
Date: Mon Jul 6 06:15:21 2020 +0000

Cleanup stale data on the logical volume

    Wipe 10MB at the beginning and at the end of each LV in cgts-vg
    immediately after we create and extend them to cleanup the stale
    data that could lead to the failure of lvm/filesystem
    puppet resource.

    Closes-Bug: 1883825
    Change-Id: I727b1cb02d74a4208a6c9593819b8513c1071fea
    Signed-off-by: Elena Taivan <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-09: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/739411
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9647e641a36b3060b0d6ded0b3b78e579a18de80
Submitter: Zuul
Branch: master

commit 9647e641a36b3060b0d6ded0b3b78e579a18de80
Author: Elena Taivan <email address hidden>
Date: Mon Jul 6 06:25:11 2020 +0000

Cleanup stale data on the logical volume

    Wipe 10MB at the end of each LV in cgts-vg
    immediately after we extend them to cleanup the stale
    data that could lead to the failure of lvm/filesystem
    puppet resource.

    Change-Id: I4bbeac0813166a47d837cb0f544af3582d9f955f
    Partial-bug: 1883825
    Signed-off-by: Elena Taivan <email address hidden>

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-07-11:

@Elena, since this LP is tagged for stx.4.0, please cherrypick the fix to the r/stx.4.0 branch asap. thx.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-16: Fix proposed to stx-puppet (r/stx.4.0)

Fix proposed to branch: r/stx.4.0
Review: https://review.opendev.org/741392

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-16: Fix proposed to ansible-playbooks (r/stx.4.0)

Fix proposed to branch: r/stx.4.0
Review: https://review.opendev.org/741393

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-16: Fix merged to ansible-playbooks (r/stx.4.0)

#10

Reviewed: https://review.opendev.org/741393
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=59e1c800fd3d49d4f1039322a1c50a779aee0602
Submitter: Zuul
Branch: r/stx.4.0

commit 59e1c800fd3d49d4f1039322a1c50a779aee0602
Author: Elena Taivan <email address hidden>
Date: Mon Jul 6 06:25:11 2020 +0000

Cleanup stale data on the logical volume

    Wipe 10MB at the end of each LV in cgts-vg
    immediately after we extend them to cleanup the stale
    data that could lead to the failure of lvm/filesystem
    puppet resource.

    Change-Id: I4bbeac0813166a47d837cb0f544af3582d9f955f
    Partial-bug: 1883825
    Signed-off-by: Elena Taivan <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-16: Fix merged to stx-puppet (r/stx.4.0)

#11

Reviewed: https://review.opendev.org/741392
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=01694b78e577ccef5be467a8d0fd450c80806a52
Submitter: Zuul
Branch: r/stx.4.0

commit 01694b78e577ccef5be467a8d0fd450c80806a52
Author: Elena Taivan <email address hidden>
Date: Mon Jul 6 06:15:21 2020 +0000

Cleanup stale data on the logical volume

    Closes-Bug: 1883825
    Change-Id: I727b1cb02d74a4208a6c9593819b8513c1071fea
    Signed-off-by: Elena Taivan <email address hidden>

Ghada Khalil (gkhalil) on 2020-07-16

tags:

added: in-r-stx40

Revision history for this message

Yang Liu (yliu12) wrote on 2020-07-27:

#12

This issue is no longer seen on wcp122 with above fix.

tags:

removed: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-11-16: Fix proposed to stx-puppet (f/centos8)

#13

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.