Worker node reporting failed after host re-install post ceph-mon fs resizing

Bug #1828262 reported by Maria Yousaf on 2019-05-08
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Tingjie Chen

Bug Description

Brief Description
-----------------
I had performed a host reinstall of a worker node and after unlock, the worker node came up as failed. The node had the following changes performed on it prior to reinstall: it was added as a ceph monitor and the ceph-mon filesystem size was increased from 20 to 40 (although, due to a known issue the filesystem change did not take and the actual size had remained at 20).

Severity
--------
Major

Steps to Reproduce
------------------
1. Add node (worker-1) as a ceph monitor after deleting host worker-0
2. Increase the size of the ceph-mon filesystem from 20 to 40.
3. Lock the host and perform a host reinstall
4. Once the host comes online, unlock it.
5. The host comes up in failed state.

Looking at the puppet.logs, I see the following reported:

2019-05-08T16:15:18.867 ^[[1;31mError: 2019-05-08 16:15:18 +0000 Execution of '/usr/sbin/lvcreate -n docker-lv --size 31457280k cgts-vg' returned 5: Volume group "cgts-vg" has insufficient free space (357 extents): 960 required.
2019-05-08T16:15:18.870 /usr/share/ruby/vendor_ruby/puppet/util/execution.rb:232:in `execute'
2019-05-08T16:15:18.874 /usr/share/ruby/vendor_ruby/puppet/provider/command.rb:23:in `execute'
2019-05-08T16:15:18.878 /usr/share/ruby/vendor_ruby/puppet/provider.rb:221:in `block in has_command'
2019-05-08T16:15:18.881 /usr/share/ruby/vendor_ruby/puppet/provider.rb:422:in `block in create_class_and_instance_method'
2019-05-08T16:15:18.884 /usr/share/puppet/modules/lvm/lib/puppet/provider/logical_volume/lvm.rb:97:in `create'
2019-05-08T16:15:18.887 /usr/share/ruby/vendor_ruby/puppet/property/ensure.rb:16:in `block in defaultvalues'
2019-05-08T16:15:18.890 /usr/share/ruby/vendor_ruby/puppet/property.rb:487:in `set'
2019-05-08T16:15:18.895 /usr/share/ruby/vendor_ruby/puppet/property.rb:561:in `sync'
2019-05-08T16:15:18.897 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:236:in `sync'
2019-05-08T16:15:18.901 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:134:in `sync_if_needed'
2019-05-08T16:15:18.904 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:80:in `perform_changes'
2019-05-08T16:15:18.908 /usr/share/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:21:in `evaluate'
2019-05-08T16:15:18.911 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:230:in `apply'
2019-05-08T16:15:18.916 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:246:in `eval_resource'
2019-05-08T16:15:18.918 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:163:in `call'
2019-05-08T16:15:18.923 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:163:in `block (2 levels) in evaluate'
2019-05-08T16:15:18.925 /usr/share/ruby/vendor_ruby/puppet/util.rb:386:in `block in thinmark'
2019-05-08T16:15:18.928 /usr/share/ruby/benchmark.rb:296:in `realtime'
2019-05-08T16:15:18.931 /usr/share/ruby/vendor_ruby/puppet/util.rb:385:in `thinmark'
2019-05-08T16:15:18.934 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:163:in `block in evaluate'
2019-05-08T16:15:18.937 /usr/share/ruby/vendor_ruby/puppet/graph/relationship_graph.rb:118:in `traverse'
2019-05-08T16:15:18.940 /usr/share/ruby/vendor_ruby/puppet/transaction.rb:154:in `evaluate'
2019-05-08T16:15:18.943 /usr/share/ruby/vendor_ruby/puppet/resource/catalog.rb:222:in `block in apply'
2019-05-08T16:15:18.946 /usr/share/ruby/vendor_ruby/puppet/util/log.rb:155:in `with_destination'
2019-05-08T16:15:18.949 /usr/share/ruby/vendor_ruby/puppet/transaction/report.rb:142:in `as_logging_destination'
2019-05-08T16:15:18.951 /usr/share/ruby/vendor_ruby/puppet/resource/catalog.rb:221:in `apply'
2019-05-08T16:15:18.953 /usr/share/ruby/vendor_ruby/puppet/configurer.rb:171:in `block in apply_catalog'
2019-05-08T16:15:18.956 /usr/share/ruby/vendor_ruby/puppet/util.rb:223:in `block in benchmark'
2019-05-08T16:15:18.958 /usr/share/ruby/benchmark.rb:296:in `realtime'

Expected Behavior
------------------
The host comes up as available after reinstall

Actual Behavior
----------------
The host came up as failed after reinstall

Reproducibility
---------------
Unsure, tried once with this load. Previously host reinstalls failed due to 1823396 on older loads.

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
Master 20190506T233000Z

Last Pass
---------
Has not passed before on StarlingX

Timestamp/Logs
--------------
See above

Test Activity
-------------
Feature testing

Maria Yousaf (myousaf) wrote :
Numan Waheed (nwaheed) on 2019-05-08
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) wrote :

Can you please try a re-install without changing ceph parameters?

Changed in starlingx:
status: New → Incomplete
assignee: nobody → Maria Yousaf (myousaf)
Maria Yousaf (myousaf) wrote :

Reinstall was successful on worker ceph-mon node that did not have any filesystem modifications. This was observed with build 20190508T013000Z (master).

Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue appears related to ceph

summary: - Worker node reporting failed after host re-install
+ Worker node reporting failed after host re-install post ceph-mon fs
+ resizing
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
tags: added: stx.2.0 stx.distro.other
tags: added: stx.config
removed: stx.distro.other
tags: added: stx.storage
Changed in starlingx:
assignee: Maria Yousaf (myousaf) → Cindy Xie (xxie1)
Ghada Khalil (gkhalil) on 2019-05-10
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Ovidiu Poncea (ovidiu.poncea)
Cindy Xie (xxie1) on 2019-05-13
tags: added: stx.distro.other
Cindy Xie (xxie1) on 2019-05-13
tags: removed: stx.storage
Ovidiu Poncea (ovidiu.poncea) wrote :

Error is:

2019-05-08-16-20-25_worker/puppet.log:5756:2019-05-08T16:21:01.179 Error: 2019-05-08 16:21:01 +0000 /Stage[main]/Platform::Filesystem::Docker/Platform::Filesystem[docker-lv]/Logical_volume[docker-lv]/ensure: change from absent to present failed: Execution of '/usr/sbin/lvcreate -n docker-lv --size 31457280k cgts-vg' returned 5: Volume group "cgts-vg" has insufficient free space (357 extents): 960 required.

Reason is that there is not enough space in cgts-vg to perform the resize. Problem is that on ceph-mon resize we don't check the available space in cgts-vg for compute nodes and, in this case, there was not enough that's why reinstall failed. We only check before doing the resize for controllers.

Ovidiu Poncea (ovidiu.poncea) wrote :

This issue will be fixed once changes for https://bugs.launchpad.net/starlingx/+bug/1827119 will be implemented as this semantic (i.e. checking available space on computes and storages when doing the resize) is implied as part of solution 2A proposed there.

So, I suggest we assign this too to Tingjie Chen.

Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Tingjie Chen (silverhandy)
tags: added: stx.storage
Changed in starlingx:
status: Triaged → In Progress
Cindy Xie (xxie1) on 2019-08-09
Changed in starlingx:
status: In Progress → Fix Released
Wendy Mitchell (wmitchellwr) wrote :

LP-1827119 was reopened. Waiting on fix for that before retesting.

Wendy Mitchell (wmitchellwr) wrote :

BUILD_ID="2019-10-20_20-00-00"

Tested on 2+2 (system has ipv6 config) wcp 71-75
verified the ceph monitor change to compute-1 from compute-0
verified update attempt to 40 is rejected (semantic checks)

$ system ceph-mon-modify controller-0 ceph_mon_gib=40
Node: compute-1 Total target growth size 18 GiB for database (doubled for upgrades), glance, scratch, backup, extension and ceph-mon exceeds growth limit of 1 GiB.

$ system ceph-mon-modify controller-0 ceph_mon_gib=23
+--------------------------------------+-------+--------------+------------+------+
| uuid | ceph_ | hostname | state | task |
| | mon_g | | | |
| | ib | | | |
+--------------------------------------+-------+--------------+------------+------+
| 92f084f4-d6c6-4e36-a89b-650fe571ac58 | 23 | compute-1 | configured | None |
| 9a56f706-3544-4fae-a834-09567fe96411 | 23 | controller-0 | configured | None |
| bc80606d-d4e0-4ecd-8d93-b70c1a12be6e | 23 | controller-1 | configured | None |
+--------------------------------------+-------+--------------+------------+------+

NOTE: ceph_mon_gib for both controllers are changed.

compute-1 reinstalled

Attempt to unlock compute-1 failed to run the puppet manifest
 3 | compute-1 | worker | unlocked | disabled | failed
[ 144.282145] worker_config[22602]: Failed to run the puppet manifest (RC:1)

Wendy Mitchell (wmitchellwr) wrote :

see puppet.log Errors

compute-1:/var/log/puppet/2019-10-21-16-54-44_worker# cat puppet.log | grep Error
2019-10-21T16:55:21.898 Error: 2019-10-21 16:55:21 +0000 Execution of '/usr/sbin/lvcreate -n docker-lv --size 31457280k cgts-vg' returned 5: Volume group "cgts-vg" has insufficient free space (901 extents): 960 required.
2019-10-21T16:55:22.125 Error: 2019-10-21 16:55:21 +0000 /Stage[main]/Platform::Filesystem::Docker/Platform::Filesystem[docker-lv]/Logical_volume[docker-lv]/ensure: change from absent to present failed: Execution of '/usr/sbin/lvcreate -n docker-lv --size 31457280k cgts-vg' returned 5: Volume group "cgts-vg" has insufficient free space (901 extents): 960 required.

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Ghada Khalil (gkhalil) wrote :

Re-opening as the re-test failed as per above

Changed in starlingx:
status: Fix Released → Confirmed
Austin Sun (sunausti) wrote :

TingJie:
    would you like shortly update for this issue?

Tingjie Chen (silverhandy) wrote :
Download full text (4.2 KiB)

The duplicated LP: 1827119 has fix released with patch: https://review.opendev.org/#/c/677424/ which merged in Sep 23th, 2019.
I have tried with 20191120T023000Z to verify the issue, but cannot reproduce, since set ceph_mon_gib=23 cannot return sucess because the available space in compute-0 have only 1 GB growth.

@Wendy, can you re-check the issue with latest image?

Following is the print information from my evaluation:
[sysadmin@controller-0 ~(keystone_admin)]$ system ceph-mon-list
+--------------------------------------+--------------+--------------+------------+------+
| uuid | ceph_mon_gib | hostname | state | task |
+--------------------------------------+--------------+--------------+------------+------+
| 2e78c9e8-e6b8-4f79-b498-92de29538c24 | 20 | controller-0 | configured | None |
| 372658d0-0a44-465a-9d8f-12d6dba6b38e | 20 | controller-1 | configured | None |
| 53a8dce1-e830-4514-97f2-652a9ec5ab9f | 20 | compute-0 | configured | None |
+--------------------------------------+--------------+--------------+---------c

$ system ceph-mon-modify controller-0 ceph_mon_gib=40
Node: compute-0 Total target growth size 20 GiB for database (doubled for upgrades), glance, scratch, backup, extension and ceph-mon exceeds growth limit of 1 GiB.

$ system ceph-mon-modify controller-0 ceph_mon_gib=23
Node: compute-0 Total target growth size 3 GiB for database (doubled for upgrades), glance, scratch, backup, extension and ceph-mon exceeds growth limit of 1 GiB.

$ system ceph-mon-modify controller-0 ceph_mon_gib=21
+--------------------------------------+--------------+--------------+------------+------+
| uuid | ceph_mon_gib | hostname | state | task |
+--------------------------------------+--------------+--------------+------------+------+
| 2e78c9e8-e6b8-4f79-b498-92de29538c24 | 21 | controller-0 | configured | None |
| 372658d0-0a44-465a-9d8f-12d6dba6b38e | 21 | controller-1 | configured | None |
| 53a8dce1-e830-4514-97f2-652a9ec5ab9f | 21 | compute-0 | configured | None |
+--------------------------------------+--------------+--------------+------------+------+

NOTE: ceph_mon_gib for both controllers are changed.

System configuration has changed.
please follow the administrator guide to complete configuring system.

# After compute-0 lock and unlock, the system host-list shows normally.

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

# s...

Read more...

Austin Sun (sunausti) wrote :

Hi TingJie:
    it seems newly test log was 10/21, which 1827119 has already be merged.
@Wendy :
    would you like confirm your test was based on master or stx.2.0 release ?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers