--force on erasure code profiles cause weird behavior and ceph cluster to fail upgrade
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph Monitor Charm |
Fix Released
|
High
|
James Page | ||
Charm Helpers |
Fix Released
|
High
|
James Page | ||
OpenStack Ceph-Proxy Charm |
Fix Released
|
High
|
James Page |
Bug Description
Hi OpenStack team,
This troubleshooting was done between billy-olsen, hillpd and me.
I am using the following bundle: https:/
In the upgrade, we move to the equivalent 20.08 charms for ceph-mon, nova-compute, cinder-ceph (which is 20.08/stable branch + EC pool commit).
Then, run upgrade to Train.
That fails with ceph-mon not being able to cluster back again & ceph-osds never finish to come back.
Looking closer to ceph-osds, one can see that they fail with core-dump.
Here is a snippet of the core-dump: https:/
The line that draw team's attention was:
Sep 26 00:06:19 node02 ceph-osd[1874627]: /build/
which should not fail with stripe_size==0.
On that ceph-osd log, we could see:
https:/
Now, the pg18 was the EC pool pg; and it was missing 2 out of 11 placement groups (K+M=11 but I had only 9 OSDs). I changed that value on charm configs, post-deployment; then removed/readded the relation with ceph-mon.
I believe that was the moment where the --force was applied, following this code line:
https:/
Which caused this issue later at upgrade.
Full logs of ceph-osd unit: https:/
-------
My proposal:
We set the --force as a boolean option on
https:/
Then, we pass force as a config parameter to ceph-mon, e.g.
allow-force-
Anyway, I believe it is the best practice here to have force option default as false to avoid any weird errors later when using EC pools.
Changed in charm-helpers: | |
status: | In Progress → Fix Released |
Changed in charm-ceph-proxy: | |
milestone: | none → 20.10 |
Changed in charm-ceph-mon: | |
milestone: | none → 20.10 |
Changed in charm-ceph-mon: | |
status: | Fix Committed → Fix Released |
Changed in charm-ceph-proxy: | |
status: | Fix Committed → Fix Released |
I'm concerned that '--force' is a thing we should never do - I can't find good documentation on whether changing a profile that is in use is even a supported operation, or are we always going to hit issues as described here.
Maybe we should just make the function in the ceph helper treat erasure profiles as idempotent - i.e. if a profile already exists, don't try and change it.