Fuel for OpenStack

Ceph cluster may wrong work after reset nodes

Bug #1568732 reported by Vadim Rovachev on 2016-04-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Vadim Rovachev	Fuel for OpenStack 8.0-updates

Bug Description

Detailed bug description:
Swarm test:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L88
on 8.0 version we often have failed status:
https://patching-ci.infra.mirantis.net/view/8.0.swarm/job/8.0.system_test.ubuntu.thread_3/
Because after deletion 2 from 6 ceph nodes:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L127-L141
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L143-L154
And cold restart all online nodes:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/tests_strength/test_restart.py#L160-L165

We may have wrong work of ceph.

Through tried to create volume on cinder logs we have next logs:
https://paste.mirantis.net/show/2100/

After some long time ceph cluster back to normal work(~ 1 hour)

Steps to reproduce:
1. Create env with 6 ceph-nodes
Example:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_ceph.py#L271-L276
2. Delete 2 ceph nodes from cluster with reconfiguring osd blocks:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L2443-L2455
3. Destroy 2 ceph nodes in step 3
4. Destroy all other nodes
5. Back to online nodes in step 4.
6. Run OSTF "Create volume and attach it to instance"

Expected results:
Test passed
Actual result:
Test Failed

See original description

Vadim Rovachev (vrovachev) on 2016-04-11

Changed in fuel:
assignee:	nobody → MOS Ceph (mos-ceph)
milestone:	none → 8.0-updates
importance:	Undecided → High
description:	updated

Revision history for this message

Vadim Rovachev (vrovachev) wrote on 2016-04-11:

fuel-snapshot-2016-04-10_02-23-37.tar.xz Edit (110.9 MiB, application/octet-stream)

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-04-11:

> Because after deletion 2 from 6 ceph nodes
> We may have wrong work of ceph.

What does that mean, exactly? Please provide the output of ceph -s, ceph osd dump, ceph mon stat,
and the exact command which misbehaves. Hint: "rbd snap create -p volumes volume-$uuid@mysnap fails
with foo-bar error message" is a good description, and "$FOO test BAR fails" is next to useless.

> 4. Destroy all other nodes
> 5. Back to online nodes in step 4.

Could you please explain what `destroy the node' actually means? Is it a hard reboot or something else?

> https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L2442-L2455

One should remove OSDs one by one giving ceph enough time to complete the data migration,
that is, one should proceed to removing the next OSD only after the data has been successfully migrated
after removing the previous one. See the official ceph documentation for more details (http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/#take-the-osd-out-of-the-cluster)

> 1. Create env with 6 ceph-nodes
> 2. Delete 2 ceph nodes from cluster
> 3. Destroy 2 ceph nodes in step 3
> 4. Destroy all other nodes

So 1/3 of the storage has been improperly removed, and imediately after that
the whole cluster has been hard rebooted.

As a side note,

> Example: https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/tests/test_ceph.py#L271-L276

This cluster has both OSDs and monitors deployed at `slave-01', `slave-02', and `slave-03'.
Deploying OSDs and monitors onto the same node is not recommended. Please reproduce the problem
with a supported configuration (no mons and OSDs on the same host), collect the relevant data,
including, but not limited to, the output of ceph -s, ceph osd dump, ceph mon stat, and the exact command
which misbehaves, and reopen this bug.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-04-11:

The problem description is way too vague, marking the bug as Incomplete. Feel free to provide the relevant data (see the comment #2) and reopen the bug.

Changed in fuel:
status:	New → Incomplete

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-04-11:

> Because after deletion 2 from 6 ceph nodes We may have wrong work of ceph.
> After some long time ceph cluster back to normal work(~ 1 hour)

Also note that depending on the amount of the data and hardware spending 1 hour to rescue the data after loosing 1/3 of the storage might be an acceptable behavior (that's yet another reason why this bug has been marked as Incomplete).

Revision history for this message

Vadim Rovachev (vrovachev) wrote on 2016-04-11:

Hello, Aleksey

> Could you please explain what `destroy the node' actually means? Is it a hard reboot or something else?

Nodes really destroyed:
https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L1737

After twice reset env from snapshots bug not reproduced.
Output of commands:
https://paste.mirantis.net/show/2101/

Aleksey, if you think, that this test is wrong, and method for delete one of ceph nodes:
https://github.com/openstack/fuel-qa/blob/master/fuelweb_test/models/fuel_web_client.py#L2503-L2528
is wrong, please give us new instructions for delete ceph nodes and reassign this bug to fuel-qa team.

Changed in fuel:
status:	Incomplete → New

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-04-11:

> Nodes really destroyed:

I really doubt it (you don't really nuke those poor servers, do you?).

https://github.com/openstack/fuel-qa/blob/stable/8.0/fuelweb_test/models/fuel_web_client.py#L1737

I guess node.destroy() expands to virsh destroy which is similar to a hard reboot.

> Output of commands: https://paste.mirantis.net/show/2101

It looks a bit fishy. In particular, line 6 reads:

osdmap e110: 5 osds: 5 up, 5 in

However there should have been 4 OSDs, not 5. Also

osd.8 up

(line 28) looks strange: the maximal OSD id should have been 5.
Which brings us to the initial point: the problem description is still way too vague.
In order to make this bug report useful one should

1) Specify the initial state of Ceph cluster (ceph -s, ceph osd dump, ceph mon stat).
2) Describe how those 2 OSDs have been removed (the exact sequence of commands)
3) Describe the state of the cluster after removal has been completed (ceph -s, ceph osd dump, ceph mon stat).
4) Explain the subsequent actions (presumably the whole cluster has been hard rebooted, is this correct?)
5) Describe the state of the cluster after the hard reboot (or whatever that was)
6) Specify which command(s) have been run after the hard reboot (is it rbd create + rbd map),
and describe their expected and actual results/outputs

Alexei Sheplyakov (asheplyakov) on 2016-04-11