System tests are failed during time synchronization over the network

Bug #1351279 reported by Dennis Dmitriev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Artem Panchenko
5.0.x
Fix Released
Medium
Artem Panchenko

Bug Description

In the system tests, after snapshots are reverted a time synchronization over the network is performed.
If there is any issue with connectivity to internet from a node then system test is failed with the following error:

2014-08-01 09:27:45,421 - INFO environment.py:373 -- Waiting for admin node to start up
2014-08-01 09:27:45,423 - INFO environment.py:375 -- Proceed with installation
2014-08-01 09:30:40,859 - INFO environment.py:410 -- Waiting while bootstrapping is in progress
2014-08-01 09:30:40,859 - INFO environment.py:413 -- Puppet timeout set in 6000.0
2014-08-01 09:38:53,688 - INFO environment.py:396 -- Sync time on revert for admin
FAIL
Prepare master node ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Bootstrap 3 slave nodes ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Deploy cluster in simple mode with Neutron GRE ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Bootstrap 5 slave nodes ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Deploy cluster in HA mode with Neutron VLAN ... SKIP: Failure in <function setup_master at 0x7f38d2272320>

======================================================================
FAIL: Create environment and set up master node
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/user/devops-venv/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/user/devops-venv/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/user/devops-venv/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/tests/base_test_case.py", line 57, in setup_master
    self.env.setup_environment()
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/models/environment.py", line 382, in setup_environment
    self.sync_time_admin_node()
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/models/environment.py", line 397, in sync_time_admin_node
    self.sync_node_time(self.get_admin_remote())
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/helpers/decorators.py", line 176, in wrapper
    return func(*args, **kwargs)
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/models/environment.py", line 390, in sync_node_time
    assert_equal(0, ntpd_exit)
AssertionError: 0 != 1

----------------------------------------------------------------------
Ran 6 tests in 819.050s

It may be reproduced with creating an environment where 'NAT' is disabled for Fuel 'admin' network.

So, any issue with connectivity to the Internet during the system tests may cause such failure (DNS resolution problem, bad iptables rules at the sync time, network packets loss in virtual env during snapshot restoring, etc).

Changed in fuel:
assignee: nobody → Fuel QA Team (fuel-qa)
Revision history for this message
Igor Shishkin (teran) wrote :

It looks very strange. Please try to sync node time from Hardware Node, that should help in case of network problems.

Changed in fuel:
importance: Undecided → Medium
importance: Medium → High
Changed in fuel:
status: New → Confirmed
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/111533

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Igor, I don't think it's related to network problems. I performed some tests with time synchronizing after reverting VMs snapshots and discovered that ntpdate on slaves return 'no server suitable for synchronization found' error, here is its full output:

[root@node-4 ~]# ntpdate -dvu $(awk '/^server/ {print $2}' /etc/ntp.conf)
 4 Aug 08:11:42 ntpdate[19763]: ntpdate 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1)
Looking for host 10.108.0.2 and service ntp
host found : 10.108.0.2
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
10.108.0.2: Server dropped: Server has gone too long without sync
server 10.108.0.2, port 123
stratum 7, precision -23, leap 00, trust 000
refid [10.108.0.2], delay 0.02567, dispersion 0.00000
transmitted 4, in filter 4
reference time: d785d512.0696535e Fri, Aug 1 2014 8:57:54.025
originate timestamp: d789bebe.fca795b1 Mon, Aug 4 2014 8:11:42.986
transmit timestamp: d789bebe.af230efd Mon, Aug 4 2014 8:11:42.684
filter delay: 0.02573 0.02570 0.02570 0.02567
         0.00000 0.00000 0.00000 0.00000
filter offset: 0.302768 0.302751 0.302744 0.302761
         0.000000 0.000000 0.000000 0.000000
delay 0.02567, dispersion 0.00000
offset 0.302761

 4 Aug 08:11:42 ntpdate[19763]: no server suitable for synchronization found

As you can see, it was able to connect to the NTP server on master node but it's 'reference time:' was to old, so it printed "10.108.0.2: Server dropped: Server has gone too long without sync" message. This error is caused because of:

1. in system tests after reverting snapshots we sync system time from HW clock and using ntpdate, but ntpd daemon still has old 'time of last date sync' ( Fri, Aug 1 2014 8:57:54.025)
2. then we sync time on slave using hwclock (Mon, Aug 4 2014 )
3. and execute ntpdate, which connects to NTP daemon, but drops it due to old 'reference time:'

Unfortunately, I didn't find out how often ntpd perform time sync on master node, but empirically I discovered that after 'service ntpd stop; ntpd -qg; service ntpd start; sleep 120' on master time from slaves works fine.

Revision history for this message
Igor Shishkin (teran) wrote :

Artem, that possible reason.
But we should get this error more often in that case.
Besides, environment is completely erasing before system tests, so it doesn't stall much time.

Do we have logs from nodes with that problem on system tests to clarify what really happen inside nodes?

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Igor, I got that error on local environment while testing patch for Ceph and trying to fix "Clock skew" warning and found out that frequently slaves don't sync time from master node due to error described above. That didn't fail of system tests during time synchronizing, but caused problems with Ceph (clock skew warning). Unfortunately, I don't have logs with enabled debug for ntpdate, so they aren't informative.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Sounds reasonable, perhaps we should use 'service ntpd stop; ntpd -qg; service ntpd start; sleep 120' indeed. There is also a related bug for this as well https://bugs.launchpad.net/fuel/+bug/1314958

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/111723

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Artem Panchenko (apanchenko-8)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/111723
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=6abb9a3b2d0ea720ad69b66a94464b3904c4a798
Submitter: Jenkins
Branch: master

commit 6abb9a3b2d0ea720ad69b66a94464b3904c4a798
Author: Artem Panchenko <email address hidden>
Date: Mon Aug 4 15:27:52 2014 +0300

    Don't fail tests if can't connect to NTP servers

    1) In case of temporary problems with internet connection on
    host server ndpd/ntpdate on Fuel master can't synchronize
    time with public NTP servers and system tests fail. Actually,
    it isn't necessarily because slaves sync time from master
    node and cluster time still will be identical, so we should
    not interrupt system tests.
    2) Tests should proceed with synchronizing time on slaves
    only if ntpd on master node is ready for connections.
    3) If ntpdate on slaves is unable to sync time with master
    node, then raise exception and break test.

    Change-Id: Ia7580767be2594bacbece3d7c6aae6a9aeffa56d
    Closes-bug: #1351279

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/112008

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (stable/5.0)

Reviewed: https://review.openstack.org/112008
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=7cd4cc1fb4dc7885163bde045682934f440a00a5
Submitter: Jenkins
Branch: stable/5.0

commit 7cd4cc1fb4dc7885163bde045682934f440a00a5
Author: Artem Panchenko <email address hidden>
Date: Tue Aug 5 16:05:27 2014 +0300

    Don't fail tests if can't connect to NTP servers

    1) In case of temporary problems with internet connection on
    host server ndpd/ntpdate on Fuel master can't synchronize
    time with public NTP servers and system tests fail. Actually,
    it isn't necessarily because slaves sync time from master
    node and cluster time still will be identical, so we should
    not interrupt system tests.
    2) Tests should proceed with synchronizing time on slaves
    only if ntpd on master node is ready for connections.
    3) If ntpdate on slaves is unable to sync time with master
    node, then raise exception and break test.

    Change-Id: I81f19bbd2a542d540d53ea8bdb42537b83867ee9
    Closes-bug: #1351279

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-main (master)

Change abandoned by Igor Shishkin (<email address hidden>) on branch: master
Review: https://review.openstack.org/111533

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.