[swarm] OSD node is offline

Bug #1643902 reported by Dmitry Belyaninov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
MOS Ceph

Bug Description

Detailed bug description:
There is failed swarm test:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/133/testReport/(root)/check_ceph_partitions_after_reboot/check_ceph_partitions_after_reboot/

AssertionError: OSD node 2 is down

So osd-2 is down on node-3:
ceph osd tree -f json ->
http://paste.openstack.org/show/590073/

Steps to reproduce:
run the test check_ceph_partitions_after_reboot
Expected results:
pass
Actual result:
fail
Reproducibility:
yes
Workaround:
 <put your information here>
Impact:
 <put your information here>
Description of the environment:
 Operation system: <put your information here>
 Versions of components: <put your information here>
 Reference architecture: <put your information here>
 Network model: <put your information here>
 Related projects installed: <put your information here>
Additional information:
 <put your information here>

tags: added: area-library
Changed in fuel:
status: New → Confirmed
tags: added: swarm-fail
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> run the test check_ceph_partitions_after_reboot

Please explain what the test does, or where the code can be found.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Look at https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/141/console

Test is rebooting osd node

01:19:11 - rebooting / ceph-osd shutdown is complete

01:22:10 - ceph is trying to start osd-2 after reboot and fails with '2016-11-30 01:22:27.310876 7f4ffa57c700 0 -- 10.109.2.3:6801/4027 >> 10.109.2.3:6805/4128 pipe(0x55f15614c000 sd=74 :6801 s=0 pgs=0 cs=0 l=0 c=0x55f155767a20).accept connect_seq 0 vs existing 0 state wait' and nothing happens after

on the same node another osd:
01:22:27 - ceph is trying to start osd-3 after reboot and fails with '2016-11-30 01:22:27.310470 7f1c8e81f700 0 -- 10.109.2.3:6805/4128 >> 10.109.2.3:6801/4027 pipe(0x56236384e000 sd=173 :6805 s=0 pgs=0 cs=0 l=0 c=0x5623631958c0).accept connect_seq 0 vs existing 0 state connecting'
01:26:37 - ceph osd-3 starting successfully with '2016-11-30 01:26:37.392529 7fe0f7daa800 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-osd, pid 2935'

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → MOS Ceph (mos-ceph)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Test is rebooting osd node

There's a devil in the details.

1) Does the test reboot a single node, all OSD nodes, or the whole cluster?
2) How long the test waits before checking OSDs' availability?
3) How exactly the test checks if the OSD is ok?

Please give a sequence of shell commands to simulate the test, or a link to the test source.
Thanks in advance

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Look at https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/141/console

This is the log of a tool which drives tests. It doesn't contain any information about what the test was actually doing and what exactly had failed.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Reproduced on 9.2 snapshot #602

This gives absolutely no useful information.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Marking as Incomplete (as it's not clear how to reproduce the problem). Please give a sequence of shell commands to simulate the test, or a link to the test source, and reopen this bug.

Changed in fuel:
status: Confirmed → Incomplete
assignee: MOS Ceph (mos-ceph) → nobody
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Reproduced again:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ceph_ha_one_controller/154/testReport/(root)/check_ceph_partitions_after_reboot/check_ceph_partitions_after_reboot/
@Aleksey, you can find env on server srv112-bud.infra.mirantis.net, just for reverting please use: source /home/jenkins/venv-nailgun-tests-2.9/bin/activate; dos.py revert-resume 9.x.system_test.ubuntu.ceph_ha_one_controller.154 error_check_ceph_partitions_after_reboot && ssh root@10.109.20.2

Changed in fuel:
assignee: nobody → MOS Ceph (mos-ceph)
status: Incomplete → Confirmed
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

@Nastya,

(venv-nailgun-tests-2.9)asheplyakov@srv112-bud:~$ dos.py revert-resume 9.x.system_test.ubuntu.ceph_ha_one_controller.154 error_check_ceph_partitions_after_reboot

Enviroment with name 9.x.system_test.ubuntu.ceph_ha_one_controller.154 doesn't exist.

(leaving aside the fact that snapshot of individual VMs does not preserve the cluster state,
such as TCP connections, timers, etc).

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

OK, I've searched github for 'check_ceph_partitions_after_reboot' and found this:

 https://github.com/openstack/fuel-qa/blob/36f7965045329eab47e089fe67207dc39e92056e/fuelweb_test/tests/test_ceph.py#L903-L916

There are several problems in this code.

1) Device nodes' names change across reboots, just because the drive's device node used to be /dev/vdb
   before a reboot doesn't mean it's going to be the same after the reboot

2) utils.get_ceph_partitions [1] should check if the target node is up (booting the OS till networking is ready takes some time), instead of blindly reporting "No such partition" error

[1] https://github.com/openstack/fuel-qa/blob/d9d8c524ef43409c0d8a2199a1fa8dfddaa500d5/fuelweb_test/helpers/utils.py#L978-L984

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

3) OSD startup takes some time, so a few down OSD just after a reboot is perfectly fine.
   So helpers.ceph.check_disks [1] should retrying instead of immediately bailing out

[1] https://github.com/openstack/fuel-qa/blob/36f7965045329eab47e089fe67207dc39e92056e/fuelweb_test/helpers/ceph.py#L117-L135

Changed in fuel:
assignee: MOS Ceph (mos-ceph) → nobody
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

The error is in fuel-qa code (see comments #10, #11) and has nothing to do with ceph itself.

Changed in fuel:
assignee: nobody → Fuel QA Team (fuel-qa)
assignee: Fuel QA Team (fuel-qa) → MOS Ceph (mos-ceph)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Alexei, please check my comment #2

osd-2 will never start, because there is no process in ps for it, it just died.

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Kostiantyn Danylov (kdanylov) wrote :

QA team - it's not clear from logs what is happening. Please take in account that is VERY unlikely, that OSD would not start after node reboot without an external reason. Please check that node is in expected state after reboot. In case if you sure that this is a ceph issue please privide clean list of steps and test environment.

Changed in fuel:
status: Confirmed → Won't Fix
Revision history for this message
Kostiantyn Danylov (kdanylov) wrote :

Make it 'won't fix in 9.2 for now'

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.