[system-tests] Time not synced properly in system tests

Bug #1393473 reported by Andrey Sledzinskiy
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Unassigned
5.1.x
Won't Fix
High
Fuel QA Team
6.0.x
Won't Fix
High
Fuel QA Team
6.1.x
Fix Released
High
Unassigned

Bug Description

{

    "build_id": "2014-11-16_21-00-23",
    "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
    "build_number": "17",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "2fc6fc4261092a591779a8fb7e3fb1623c6abb85",
    "production": "docker",
    "fuelmain_sha": "b118fa4475833ce031ef189ce280772c676fa1c9",
    "astute_sha": "702af3db6f5bca92525bc8322d7d5d7675ec857e",
    "feature_groups": [
        "mirantis"
    ],
    "release": "5.1.1",
    "release_versions": {
        "2014.1.3-5.1.1": {
            "VERSION": {
                "build_id": "2014-11-16_21-00-23",
                "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
                "build_number": "17",
                "api": "1.0",
                "nailgun_sha": "2fc6fc4261092a591779a8fb7e3fb1623c6abb85",
                "production": "docker",
                "fuelmain_sha": "b118fa4475833ce031ef189ce280772c676fa1c9",
                "astute_sha": "702af3db6f5bca92525bc8322d7d5d7675ec857e",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "5.1.1",
                "fuellib_sha": "0d3909b9a291880af28dbe48b9c7d25215aa98ea"
            }
        }
    },
    "fuellib_sha": "0d3909b9a291880af28dbe48b9c7d25215aa98ea"

}

Traceback (most recent call last):
  File "/usr/lib/python2.7/unittest/case.py", line 331, in run
    testMethod()
  File "/usr/lib/python2.7/unittest/case.py", line 1043, in runTest
    self._testFunc()
  File "/usr/lib/python2.7/dist-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/usr/lib/python2.7/dist-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/usr/lib/python2.7/dist-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/helpers/decorators.py", line 52, in wrapper
    return func(*args, **kwagrs)
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/tests/tests_strength/test_restart.py", line 114, in ceph_ha_restart
    self.fuel_web.check_ceph_status(cluster_id)
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.ubuntu.thread_3/fuelweb_test/models/fuel_web_client.py", line 1095, in check_ceph_status
    interval=30, timeout=recovery_timeout)
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 97, in wait
    raise TimeoutError("Waiting timed out")
TimeoutError: Waiting timed out

After revert and ssh to slave-01 (node-1) - 'ceph health' showed - Clock skew detected. It was ok after couple of minutes.
Seems that our time sync didn't help

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Our time sync definitely helps. But it proceeds some time later after 'ceph' service started and detected a clock skew.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Andrey Sledzinskiy, you are right, ntpd on the master was into de-synced state for several time (for about an hour):

[root@nailgun ~]# ntptime
ntp_gettime() returns code 5 (ERROR)
  time d814ba4b.636b3b94 Mon, Nov 17 2014 18:17:47.388, (.388355681),
  maximum error 16000000 us, estimated error 16000000 us, TAI offset 0
ntp_adjtime() returns code 5 (ERROR)
  modes 0x0 (),
  offset 0.000 us, frequency -500.000 ppm, interval 1 s,
  maximum error 16000000 us, estimated error 16000000 us,
  status 0x2041 (PLL,UNSYNC,NANO),
  time constant 6, precision 0.001 us, tolerance 500 ppm,

[root@nailgun ~]# ntpq -p
     remote refid st t when poll reach delay offset jitter
==============================================================================
+ns.aksinet.net 195.2.64.5 2 u 18 64 377 22.068 330.121 130.682
*195.91.239.8 .PPS. 1 u 17 64 377 15.440 331.223 129.450
+cello.corbina.n 192.36.144.22 2 u 14 64 377 23.976 298.865 104.554
 LOCAL(0) .LOCL. 10 l 642 64 0 0.000 0.000 0.000

[root@nailgun ~]# ntpq -c assoc

ind assid status conf reach auth condition last_event cnt
===========================================================
  1 44432 945a yes yes none candidate sys_peer 5
  2 44433 961a yes yes none sys.peer sys_peer 1
  3 44434 9414 yes yes none candidate reachable 1
  4 44435 8023 yes no none reject unreachable 2

It was synchronized later :

[root@nailgun ~]# ntptime
ntp_gettime() returns code 0 (OK)
  time d814c205.8fa1f47c Mon, Nov 17 2014 18:50:45.561, (.561065380),
  maximum error 173659 us, estimated error 21622 us, TAI offset 0
ntp_adjtime() returns code 0 (OK)
  modes 0x0 (),
  offset 23017.206 us, frequency 42.672 ppm, interval 1 s,
  maximum error 173659 us, estimated error 21622 us,
  status 0x2001 (PLL,NANO),
  time constant 6, precision 0.001 us, tolerance 500 ppm,

We need to reconsider our time sync strategy.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

the same issue in test destroy_controllers

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

also time was not synced on node destructive tests - as result rabbit cluster was without 1 node and ostf failed. I synced time manually - and all the tests are passed

Changed in fuel:
status: New → Triaged
summary: - [System tests] Ceph_ha_restart test failed because of time skew detected
+ [System tests] Time not synced properly in system tests
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote : Re: [System tests] Time not synced properly in system tests

Issue occurs also on 6.1 ci

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on Ubuntu BVT_2 : http://jenkins-product.srt.mirantis.net:8080/view/6.1/job/6.1.ubuntu.bvt_2/96/

root@node-2:~# ceph status
    cluster 0fd469bf-0f2e-4d5d-8b3c-8229980020c5
     health HEALTH_WARN clock skew detected on mon.node-6
     monmap e3: 3 mons at {node-1=10.108.22.3:6789/0,node-2=10.108.22.4:6789/0,node-6=10.108.22.8:6789/0}, election epoch 16, quorum 0,1,2 node-1,node-2,node-6
     osdmap e75: 12 osds: 12 up, 12 in
      pgmap v166: 3264 pgs, 6 pools, 12859 kB data, 5 objects
            25112 MB used, 568 GB / 592 GB avail
                3264 active+clean

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Once we added Ceph deploy for BVT, the time sync became a critical issue

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I suggest the following steps for Ceph system tests which should be done before each snapshot/revert:

Before to freeze VM sand make snapshots:
* stop all ceph services at all nodes (gracefully)

After the snapshots has been reverted and VMs unfreezed:
* Ensure time in sync
* Start all ceph services
* wait for ceph_health reports OK for all nodes.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Perhaps, stopping services could be replaced to some noop (maintenance) mode for ceph as well. I guess the fix should be elaborated with Ceph experts

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Dennis, Bogdan,

this bug is about completely another issue in another system test ''ceph_ha_restart". In that tests we revert snapshot and then check Ceph health:

https://github.com/stackforge/fuel-main/blob/master/fuelweb_test/tests/tests_strength/test_restart.py#L128

And you are talking about 'ceph_ha' test (part of "bvt_2" test group), where we perform Ceph health check right after cluster deployment:

https://github.com/stackforge/fuel-main/blob/master/fuelweb_test/tests/test_ceph.py#L224

Here is the issue that should be fixed to make 6.1.ubuntu.bvt_2 green: https://bugs.launchpad.net/fuel/+bug/1417922

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Artem, thank you for clarifying this!

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
summary: - [System tests] Time not synced properly in system tests
+ [system-tests] Time not synced properly in system tests
tags: added: non-release
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/185594

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/185594
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=8d30e4260b6436cae1e152680377b3929516c430
Submitter: Jenkins
Branch: master

commit 8d30e4260b6436cae1e152680377b3929516c430
Author: Ivan Kliuk <email address hidden>
Date: Tue May 26 13:13:05 2015 +0000

    Change time check and sync procedure on Ceph nodes

      Add fuelweb_test.checkers.ceph module which aggregates all Ceph related
      handlers and checkers.
      Additional logic is provided to check time consistency of Ceph cluster.
      Before checking Ceph health status, the test checks a time skew and tries to
      sync time on inconsistent nodes.
      Fix time logging format in ntp.py.

    Change-Id: I5f31785fa4d9b3adc9f1f0354bc24431745f115a
    Closes-Bug: #1393473

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

MOS5.1 and MOS6.0 are no longer supported, moving to Won't Fix.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.