Ceph cluster may wrong work after reset nodes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
High
|
Vadim Rovachev |
Bug Description
Detailed bug description:
Swarm test:
https:/
on 8.0 version we often have failed status:
https:/
Because after deletion 2 from 6 ceph nodes:
https:/
https:/
And cold restart all online nodes:
https:/
We may have wrong work of ceph.
Through tried to create volume on cinder logs we have next logs:
https:/
After some long time ceph cluster back to normal work(~ 1 hour)
Steps to reproduce:
1. Create env with 6 ceph-nodes
Example:
https:/
2. Delete 2 ceph nodes from cluster with reconfiguring osd blocks:
https:/
3. Destroy 2 ceph nodes in step 3
4. Destroy all other nodes
5. Back to online nodes in step 4.
6. Run OSTF "Create volume and attach it to instance"
Expected results:
Test passed
Actual result:
Test Failed
Changed in fuel: | |
assignee: | nobody → MOS Ceph (mos-ceph) |
milestone: | none → 8.0-updates |
importance: | Undecided → High |
description: | updated |
Changed in fuel: | |
status: | New → Incomplete |
Changed in fuel: | |
status: | Incomplete → Invalid |
> Because after deletion 2 from 6 ceph nodes
> We may have wrong work of ceph.
What does that mean, exactly? Please provide the output of ceph -s, ceph osd dump, ceph mon stat,
and the exact command which misbehaves. Hint: "rbd snap create -p volumes volume-$uuid@mysnap fails
with foo-bar error message" is a good description, and "$FOO test BAR fails" is next to useless.
> 4. Destroy all other nodes
> 5. Back to online nodes in step 4.
Could you please explain what `destroy the node' actually means? Is it a hard reboot or something else?
> https:/ /github. com/openstack/ fuel-qa/ blob/stable/ 8.0/fuelweb_ test/models/ fuel_web_ client. py#L2442- L2455
One should remove OSDs one by one giving ceph enough time to complete the data migration, docs.ceph. com/docs/ hammer/ rados/operation s/add-or- rm-osds/ #take-the- osd-out- of-the- cluster)
that is, one should proceed to removing the next OSD only after the data has been successfully migrated
after removing the previous one. See the official ceph documentation for more details (http://
> 1. Create env with 6 ceph-nodes
> 2. Delete 2 ceph nodes from cluster
> 3. Destroy 2 ceph nodes in step 3
> 4. Destroy all other nodes
So 1/3 of the storage has been improperly removed, and imediately after that
the whole cluster has been hard rebooted.
As a side note,
> Example: https:/ /github. com/openstack/ fuel-qa/ blob/stable/ 8.0/fuelweb_ test/tests/ test_ceph. py#L271- L276
This cluster has both OSDs and monitors deployed at `slave-01', `slave-02', and `slave-03'.
Deploying OSDs and monitors onto the same node is not recommended. Please reproduce the problem
with a supported configuration (no mons and OSDs on the same host), collect the relevant data,
including, but not limited to, the output of ceph -s, ceph osd dump, ceph mon stat, and the exact command
which misbehaves, and reopen this bug.