[systest] We should sync time if ceph is command fails in test deploy_ceph_ha

Bug #1336233 reported by Tatyanka on 2014-07-01
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Medium
Artem Panchenko
4.1.x
Medium
Artem Panchenko
5.0.x
Medium
Artem Panchenko

Bug Description

Test http://jenkins-product.srt.mirantis.net:8080/view/5.0_swarm/job/5.0_fuelmain.system_test.centos.thread_3/29/testReport/junit/%28root%29/ceph_ha/ceph_ha/ failed with unclear waiting timeout error

....
....
....
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 95, in wait
    raise TimeoutError("Waiting timed out")
TimeoutError: Waiting timed out

Several things should be done:
1. Add error message to the ceph_health check in case of failure(we can use sdterr to know what happens)
2. In this case ceph health was warn according to mon-0 has feature time, after sync time - on all nodes ceph health is fine -so we need to add dsome workaround here like:
try:
    wait(....)
except TimeoutError:
    if 'clocks not synchronized ' in output from ceph -w command:
       run sync time command(method) and try again to verify health

Below we can see why ceph health is warn
[root@node-4 ~]# ceph -w
  cluster 7cec7fda-d7ca-4836-b954-55373a0e95f3
   health HEALTH_WARN clock skew detected on mon.node-2, mon.node-4
   monmap e3: 3 mons at {node-1=10.108.17.3:6789/0,node-2=10.108.17.4:6789/0,node-4=10.108.17.6:6789/0}, election epoch 16, quorum 0,1,2 node-1,node-2,node-4
   osdmap e65: 12 osds: 12 up, 12 in
    pgmap v104: 3264 pgs: 3264 active+clean; 14464 KB data, 25085 MB used, 568 GB / 592 GB avail
   mdsmap e1: 0/0/1 up

2014-07-01 03:15:57.938597 mon.1 [WRN] message from mon.0 was stamped 0.231240s in the future, clocks not synchronized

^C[root@node-4 ~]# exit
logout
Connection to node-4 closed.
[root@nailgun ~]# ssh node-1
Warning: Permanently added 'node-1' (RSA) to the list of known hosts.
Last login: Tue Jul 1 03:18:29 2014 from 10.108.15.2
[root@node-1 ~]# sudo hwclock --show
Tue 01 Jul 2014 11:19:28 AM UTC -0.624457 seconds
[root@node-1 ~]# date
Tue Jul 1 03:23:01 UTC 2014

Changed in fuel:
status: New → Confirmed
Dmitry Ilyin (idv1985) on 2014-07-15
summary: - [System tests] We should sync time if ceph is command fails in test
+ [systest] We should sync time if ceph is command fails in test
deploy_ceph_ha

Fix proposed to branch: master
Review: https://review.openstack.org/108434

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Artem Panchenko (apanchenko-8)
status: Confirmed → In Progress
Artem Panchenko (apanchenko-8) wrote :

Unfortunately, time sync on all nodes doesn't resolve the problem, 'ceph health detail' still returns warning and reports clock skew even after sleep(1800). But if restart Ceph service after executing ntpdate its health status becomes 'HEALTH_OK'.

Dmitry Borodaenko (angdraug) wrote :

Ceph is supposed to notice that clock skew was fixed, but it may take some time (~30s) for it to start reporting HEALTH_OK.

Reviewed: https://review.openstack.org/108434
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=7d8a35fdff32146306c11ff8b76c1b63cbb08a80
Submitter: Jenkins
Branch: master

commit 7d8a35fdff32146306c11ff8b76c1b63cbb08a80
Author: Artem Panchenko <email address hidden>
Date: Mon Jul 21 19:48:13 2014 +0300

    Sync time on nodes before checking Ceph health

    System test 'ceph_ha' almost always fails while
    checking ceph health due to clock skew. That's
    why an additional time synchronization and restart
    of Ceph service was added to the test.

    Change-Id: I58af30d29399c70b9f5d142d3c495710f05994f2
    Closes-bug: #1336233

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/108448
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=c5d74f393055f4dadd732cb8d84fd4198ccffcc9
Submitter: Jenkins
Branch: stable/5.0

commit c5d74f393055f4dadd732cb8d84fd4198ccffcc9
Author: Artem Panchenko <email address hidden>
Date: Mon Jul 21 20:28:24 2014 +0300

    Sync time on nodes before checking Ceph health

    System test 'ceph_ha' almost always fails while
    checking ceph health due to clock skew. That's
    why an additional time synchronization and restart
    of Ceph service was added to the test.

    Change-Id: Ic6397e878fdca747f33e47f05e846d25f061ce7b
    Closes-bug: #1336233

Dennis Dmitriev (ddmitriev) wrote :

Fix Released:
api: '1.0'
astute_sha: 6db5f5031b74e67b92fcac1f7998eaa296d68025
build_id: 2014-07-25_00-31-14
build_number: '146'
fuellib_sha: 19acf67997492580beede54d54895c876e1f3f18
fuelmain_sha: 9aa2e3f4d60337a4fa75b0dd9e1c904dc1221102
mirantis: 'yes'
nailgun_sha: 17444180b7e8c0c454488e63a05693881168f76a
ostf_sha: 09b6bccf7d476771ac859bb3c76c9ebec9da9e1f
production: docker
release: 5.0.1

Dennis Dmitriev (ddmitriev) wrote :

Fix released:
api: '1.0'
astute_sha: fd9b8e3b6f59b2727b1b037054f10e0dd7bd37f1
auth_required: false
build_id: 2014-07-24_02-01-14
build_number: '351'
feature_groups:
- mirantis
fuellib_sha: 8bffb2a4723109614aeaabaabffa3c94a1b72705
fuelmain_sha: 103ce9abd6e2632ec1029d1aa3e918517417cba3
nailgun_sha: 744e17cc03207c46ecd79f4ac78fde98f75aec2f
ostf_sha: 81b019a502711dfcd935a981b292e88eb956b141
production: docker
release: '5.1'

Changed in fuel:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/109924
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=fa218a36d2686de6bb36ef8c6b33526c9d802e34
Submitter: Jenkins
Branch: stable/4.1

commit fa218a36d2686de6bb36ef8c6b33526c9d802e34
Author: Artem Panchenko <email address hidden>
Date: Mon Jul 28 11:38:17 2014 +0300

    Sync time on nodes before checking Ceph health

    System test 'ceph_ha' almost always fails while
    checking ceph health due to clock skew. That's
    why an additional time synchronization and restart
    of Ceph service was added to the test.

    Change-Id: I4f493bf92a2ab28b4833cdcb46ed3a8a9a92fc0f
    Closes-bug: #1336233

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers