Ceph cluster may wrong work after reset nodes

Bug #1568732 reported by Vadim Rovachev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Vadim Rovachev

Bug Description

Detailed bug description:
Swarm test:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L88
on 8.0 version we often have failed status:
https://patching-ci.infra.mirantis.net/view/8.0.swarm/job/8.0.system_test.ubuntu.thread_3/
Because after deletion 2 from 6 ceph nodes:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L127-L141
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L143-L154
And cold restart all online nodes:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L160-L165

We may have wrong work of ceph.

Through tried to create volume on cinder logs we have next logs:
https://paste.mirantis.net/show/2100/

After some long time ceph cluster back to normal work(~ 1 hour)

Steps to reproduce:
1. Create env with 6 ceph-nodes
Example:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_ceph.py#L271-L276
2. Delete 2 ceph nodes from cluster with reconfiguring osd blocks:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L2443-L2455
3. Destroy 2 ceph nodes in step 3
4. Destroy all other nodes
5. Back to online nodes in step 4.
6. Run OSTF "Create volume and attach it to instance"

Expected results:
Test passed
Actual result:
Test Failed

Changed in fuel:
assignee: nobody → MOS Ceph (mos-ceph)
milestone: none → 8.0-updates
importance: Undecided → High
description: updated
Revision history for this message
Vadim Rovachev (vrovachev) wrote :
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Because after deletion 2 from 6 ceph nodes
> We may have wrong work of ceph.

What does that mean, exactly? Please provide the output of ceph -s, ceph osd dump, ceph mon stat,
and the exact command which misbehaves. Hint: "rbd snap create -p volumes volume-$uuid@mysnap fails
with foo-bar error message" is a good description, and "$FOO test BAR fails" is next to useless.

> 4. Destroy all other nodes
> 5. Back to online nodes in step 4.

Could you please explain what `destroy the node' actually means? Is it a hard reboot or something else?

> https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L2442-L2455

One should remove OSDs one by one giving ceph enough time to complete the data migration,
that is, one should proceed to removing the next OSD only after the data has been successfully migrated
after removing the previous one. See the official ceph documentation for more details (http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/#take-the-osd-out-of-the-cluster)

> 1. Create env with 6 ceph-nodes
> 2. Delete 2 ceph nodes from cluster
> 3. Destroy 2 ceph nodes in step 3
> 4. Destroy all other nodes

So 1/3 of the storage has been improperly removed, and imediately after that
the whole cluster has been hard rebooted.

As a side note,

> Example: https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_ceph.py#L271-L276

This cluster has both OSDs and monitors deployed at `slave-01', `slave-02', and `slave-03'.
Deploying OSDs and monitors onto the same node is not recommended. Please reproduce the problem
with a supported configuration (no mons and OSDs on the same host), collect the relevant data,
including, but not limited to, the output of ceph -s, ceph osd dump, ceph mon stat, and the exact command
which misbehaves, and reopen this bug.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

The problem description is way too vague, marking the bug as Incomplete. Feel free to provide the relevant data (see the comment #2) and reopen the bug.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Because after deletion 2 from 6 ceph nodes We may have wrong work of ceph.
> After some long time ceph cluster back to normal work(~ 1 hour)

Also note that depending on the amount of the data and hardware spending 1 hour to rescue the data after loosing 1/3 of the storage might be an acceptable behavior (that's yet another reason why this bug has been marked as Incomplete).

Revision history for this message
Vadim Rovachev (vrovachev) wrote :

Hello, Aleksey

> Could you please explain what `destroy the node' actually means? Is it a hard reboot or something else?

Nodes really destroyed:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L1737

After twice reset env from snapshots bug not reproduced.
Output of commands:
https://paste.mirantis.net/show/2101/

Aleksey, if you think, that this test is wrong, and method for delete one of ceph nodes:
https://github.com/openstack/fuel-qa/blob/master/fuelweb_test/models/fuel_web_client.py#L2503-L2528
is wrong, please give us new instructions for delete ceph nodes and reassign this bug to fuel-qa team.

Changed in fuel:
status: Incomplete → New
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Nodes really destroyed:

I really doubt it (you don't really nuke those poor servers, do you?).

https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L1737

I guess node.destroy() expands to virsh destroy which is similar to a hard reboot.

> Output of commands: https://paste.mirantis.net/show/2101

It looks a bit fishy. In particular, line 6 reads:

  osdmap e110: 5 osds: 5 up, 5 in

However there should have been 4 OSDs, not 5. Also

osd.8 up

(line 28) looks strange: the maximal OSD id should have been 5.
Which brings us to the initial point: the problem description is still way too vague.
In order to make this bug report useful one should

1) Specify the initial state of Ceph cluster (ceph -s, ceph osd dump, ceph mon stat).
2) Describe how those 2 OSDs have been removed (the exact sequence of commands)
3) Describe the state of the cluster after removal has been completed (ceph -s, ceph osd dump, ceph mon stat).
4) Explain the subsequent actions (presumably the whole cluster has been hard rebooted, is this correct?)
5) Describe the state of the cluster after the hard reboot (or whatever that was)
6) Specify which command(s) have been run after the hard reboot (is it rbd create + rbd map),
    and describe their expected and actual results/outputs

Changed in fuel:
status: New → Incomplete
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Vadim, please see Alexey's comment.

Changed in fuel:
assignee: MOS Ceph (mos-ceph) → Vadim Rovachev (vrovachev)
Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.