Fuel for OpenStack

[swarm] OSD node is offline

Bug #1643902 reported by Dmitry Belyaninov on 2016-11-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Won't Fix	High	MOS Ceph	Fuel for OpenStack 9.2

Bug Description

Detailed bug description:
There is failed swarm test:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/133/testReport/(root)/check_ceph_partitions_after_reboot/check_ceph_partitions_after_reboot/

AssertionError: OSD node 2 is down

So osd-2 is down on node-3:
ceph osd tree -f json ->
http://paste.openstack.org/show/590073/

Steps to reproduce:
run the test check_ceph_partitions_after_reboot
Expected results:
pass
Actual result:
fail
Reproducibility:
yes
Workaround:
<put your information here>
Impact:
<put your information here>
Description of the environment:
Operation system: <put your information here>
Versions of components: <put your information here>
Reference architecture: <put your information here>
Network model: <put your information here>
Related projects installed: <put your information here>
Additional information:
<put your information here>

Tags:

Oleksiy Molchanov (omolchanov) on 2016-11-24

tags:	added: area-library
Changed in fuel:
status:	New → Confirmed

Nastya Urlapova (aurlapova) on 2016-11-30

tags:

added: swarm-fail

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-02:

> run the test check_ceph_partitions_after_reboot

Please explain what the test does, or where the code can be found.

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-12-02:

fail_error_check_ceph_partitions_after_reboot-fuel-snapshot-2016-11-30_01-28-53.tar Edit (45.8 MiB, application/x-tar)

Look at https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/141/console

Test is rebooting osd node

01:19:11 - rebooting / ceph-osd shutdown is complete

01:22:10 - ceph is trying to start osd-2 after reboot and fails with '2016-11-30 01:22:27.310876 7f4ffa57c700 0 -- 10.109.2.3:6801/4027 >> 10.109.2.3:6805/4128 pipe(0x55f15614c000 sd=74 :6801 s=0 pgs=0 cs=0 l=0 c=0x55f155767a20).accept connect_seq 0 vs existing 0 state wait' and nothing happens after

on the same node another osd:
01:22:27 - ceph is trying to start osd-3 after reboot and fails with '2016-11-30 01:22:27.310470 7f1c8e81f700 0 -- 10.109.2.3:6805/4128 >> 10.109.2.3:6801/4027 pipe(0x56236384e000 sd=173 :6805 s=0 pgs=0 cs=0 l=0 c=0x5623631958c0).accept connect_seq 0 vs existing 0 state connecting'
01:26:37 - ceph osd-3 starting successfully with '2016-11-30 01:26:37.392529 7fe0f7daa800 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-osd, pid 2935'

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → MOS Ceph (mos-ceph)

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-03:

> Test is rebooting osd node

There's a devil in the details.

1) Does the test reboot a single node, all OSD nodes, or the whole cluster?
2) How long the test waits before checking OSDs' availability?
3) How exactly the test checks if the OSD is ok?

Please give a sequence of shell commands to simulate the test, or a link to the test source.
Thanks in advance

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-03:

> Look at https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/141/console

This is the log of a tool which drives tests. It doesn't contain any information about what the test was actually doing and what exactly had failed.

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Alexey. Kalashnikov (akalashnikov) wrote on 2016-12-07:

Reproduced on 9.2 snapshot #602
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/147/testReport/(root)/check_ceph_partitions_after_reboot/check_ceph_partitions_after_reboot/

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-07:

> Reproduced on 9.2 snapshot #602

This gives absolutely no useful information.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-07:

Marking as Incomplete (as it's not clear how to reproduce the problem). Please give a sequence of shell commands to simulate the test, or a link to the test source, and reopen this bug.

Changed in fuel:
status:	Confirmed → Incomplete
assignee:	MOS Ceph (mos-ceph) → nobody

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-12-19:

Reproduced again:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/154/testReport/(root)/check_ceph_partitions_after_reboot/check_ceph_partitions_after_reboot/
@Aleksey, you can find env on server srv112-bud.infra.mirantis.net, just for reverting please use: source /home/jenkins/venv-nailgun-tests-2.9/bin/activate; dos.py revert-resume 9.x.system_test.ubuntu.ceph_ha_one_controller.154 error_check_ceph_partitions_after_reboot && ssh root@10.109.20.2

Changed in fuel:
assignee:	nobody → MOS Ceph (mos-ceph)
status:	Incomplete → Confirmed

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-20:

@Nastya,

(venv-nailgun-tests-2.9)asheplyakov@srv112-bud:~$ dos.py revert-resume 9.x.system_test.ubuntu.ceph_ha_one_controller.154 error_check_ceph_partitions_after_reboot

Enviroment with name 9.x.system_test.ubuntu.ceph_ha_one_controller.154 doesn't exist.

(leaving aside the fact that snapshot of individual VMs does not preserve the cluster state,
such as TCP connections, timers, etc).

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-20:

#10

OK, I've searched github for 'check_ceph_partitions_after_reboot' and found this:

https://github.com/openstack/fuel-qa/blob/36f7965045329eab47e089fe67207dc39e92056e/fuelweb_test/tests/test_ceph.py#L903-L916

There are several problems in this code.

1) Device nodes' names change across reboots, just because the drive's device node used to be /dev/vdb
before a reboot doesn't mean it's going to be the same after the reboot

2) utils.get_ceph_partitions [1] should check if the target node is up (booting the OS till networking is ready takes some time), instead of blindly reporting "No such partition" error

[1] https://github.com/openstack/fuel-qa/blob/d9d8c524ef43409c0d8a2199a1fa8dfddaa500d5/fuelweb_test/helpers/utils.py#L978-L984

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-20:

#11

3) OSD startup takes some time, so a few down OSD just after a reboot is perfectly fine.
So helpers.ceph.check_disks [1] should retrying instead of immediately bailing out

[1] https://github.com/openstack/fuel-qa/blob/36f7965045329eab47e089fe67207dc39e92056e/fuelweb_test/helpers/ceph.py#L117-L135

Changed in fuel:
assignee:	MOS Ceph (mos-ceph) → nobody

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-12-20:

#12

The error is in fuel-qa code (see comments #10, #11) and has nothing to do with ceph itself.

Oleksiy Molchanov (omolchanov) on 2016-12-20

Changed in fuel:
assignee:	nobody → Fuel QA Team (fuel-qa)
assignee:	Fuel QA Team (fuel-qa) → MOS Ceph (mos-ceph)

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-12-20:

#13

Alexei, please check my comment #2

osd-2 will never start, because there is no process in ps for it, it just died.

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-12-27:

#14

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/160/testReport/(root)/check_ceph_partitions_after_reboot/check_ceph_partitions_after_reboot/

Revision history for this message

Kostiantyn Danylov (kdanylov) wrote on 2016-12-27:

#15

QA team - it's not clear from logs what is happening. Please take in account that is VERY unlikely, that OSD would not start after node reboot without an external reason. Please check that node is in expected state after reboot. In case if you sure that this is a ceph issue please privide clean list of steps and test environment.