Fuel for OpenStack

System tests are failed during time synchronization over the network

Bug #1351279 reported by Dennis Dmitriev on 2014-08-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	Medium	Artem Panchenko	Fuel for OpenStack 5.1
	5.0.x	Fix Released	Medium	Artem Panchenko	Fuel for OpenStack 5.0.2

Bug Description

In the system tests, after snapshots are reverted a time synchronization over the network is performed.
If there is any issue with connectivity to internet from a node then system test is failed with the following error:

2014-08-01 09:27:45,421 - INFO environment.py:373 -- Waiting for admin node to start up
2014-08-01 09:27:45,423 - INFO environment.py:375 -- Proceed with installation
2014-08-01 09:30:40,859 - INFO environment.py:410 -- Waiting while bootstrapping is in progress
2014-08-01 09:30:40,859 - INFO environment.py:413 -- Puppet timeout set in 6000.0
2014-08-01 09:38:53,688 - INFO environment.py:396 -- Sync time on revert for admin
FAIL
Prepare master node ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Bootstrap 3 slave nodes ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Deploy cluster in simple mode with Neutron GRE ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Bootstrap 5 slave nodes ... SKIP: Failure in <function setup_master at 0x7f38d2272320>
Deploy cluster in HA mode with Neutron VLAN ... SKIP: Failure in <function setup_master at 0x7f38d2272320>

======================================================================
FAIL: Create environment and set up master node
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/user/devops-venv/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/user/devops-venv/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/user/devops-venv/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/tests/base_test_case.py", line 57, in setup_master
    self.env.setup_environment()
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/models/environment.py", line 382, in setup_environment
    self.sync_time_admin_node()
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/models/environment.py", line 397, in sync_time_admin_node
    self.sync_node_time(self.get_admin_remote())
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/helpers/decorators.py", line 176, in wrapper
    return func(*args, **kwargs)
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/user/workspace/fuel_master.centos.bvt_1/fuelweb_test/models/environment.py", line 390, in sync_node_time
    assert_equal(0, ntpd_exit)
AssertionError: 0 != 1

----------------------------------------------------------------------
Ran 6 tests in 819.050s

It may be reproduced with creating an environment where 'NAT' is disabled for Fuel 'admin' network.

So, any issue with connectivity to the Internet during the system tests may cause such failure (DNS resolution problem, bad iptables rules at the sync time, network packets loss in virtual env during snapshot restoring, etc).

Dennis Dmitriev (ddmitriev) on 2014-08-01

Changed in fuel:
assignee:	nobody → Fuel QA Team (fuel-qa)

Revision history for this message

Igor Shishkin (teran) wrote on 2014-08-01:

It looks very strange. Please try to sync node time from Hardware Node, that should help in case of network problems.

Changed in fuel:
importance:	Undecided → Medium
importance:	Medium → High

Nastya Urlapova (aurlapova) on 2014-08-01

Changed in fuel:
status:	New → Confirmed
importance:	High → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-03: Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/111533

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-08-04:

Igor, I don't think it's related to network problems. I performed some tests with time synchronizing after reverting VMs snapshots and discovered that ntpdate on slaves return 'no server suitable for synchronization found' error, here is its full output:

[root@node-4 ~]# ntpdate -dvu $(awk '/^server/ {print $2}' /etc/ntp.conf)
4 Aug 08:11:42 ntpdate[19763]: ntpdate 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1)
Looking for host 10.108.0.2 and service ntp
host found : 10.108.0.2
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
10.108.0.2: Server dropped: Server has gone too long without sync
server 10.108.0.2, port 123
stratum 7, precision -23, leap 00, trust 000
refid [10.108.0.2], delay 0.02567, dispersion 0.00000
transmitted 4, in filter 4
reference time: d785d512.0696535e Fri, Aug 1 2014 8:57:54.025
originate timestamp: d789bebe.fca795b1 Mon, Aug 4 2014 8:11:42.986
transmit timestamp: d789bebe.af230efd Mon, Aug 4 2014 8:11:42.684
filter delay: 0.02573 0.02570 0.02570 0.02567
0.00000 0.00000 0.00000 0.00000
filter offset: 0.302768 0.302751 0.302744 0.302761
0.000000 0.000000 0.000000 0.000000
delay 0.02567, dispersion 0.00000
offset 0.302761

4 Aug 08:11:42 ntpdate[19763]: no server suitable for synchronization found

As you can see, it was able to connect to the NTP server on master node but it's 'reference time:' was to old, so it printed "10.108.0.2: Server dropped: Server has gone too long without sync" message. This error is caused because of:

1. in system tests after reverting snapshots we sync system time from HW clock and using ntpdate, but ntpd daemon still has old 'time of last date sync' ( Fri, Aug 1 2014 8:57:54.025)
2. then we sync time on slave using hwclock (Mon, Aug 4 2014 )
3. and execute ntpdate, which connects to NTP daemon, but drops it due to old 'reference time:'

Unfortunately, I didn't find out how often ntpd perform time sync on master node, but empirically I discovered that after 'service ntpd stop; ntpd -qg; service ntpd start; sleep 120' on master time from slaves works fine.

Igor, I don't think it's related to network problems. I performed some tests with time synchronizing after reverting VMs snapshots and discovered that ntpdate on slaves return  'no server suitable for synchronization found' error, here is its full output:

[root@node-4 ~]# ntpdate -dvu $(awk '/^server/ {print $2}' /etc/ntp.conf)
 4 Aug 08:11:42 ntpdate[19763]: ntpdate 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1)
Looking for host 10.108.0.2 and service ntp
host found : 10.108.0.2
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
transmit(10.108.0.2)
receive(10.108.0.2)
10.108.0.2: Server dropped: Server has gone too long without sync
server 10.108.0.2, port 123
stratum 7, precision -23, leap 00, trust 000
refid [10.108.0.2], delay 0.02567, dispersion 0.00000
transmitted 4, in filter 4
reference time:    d785d512.0696535e  Fri, Aug  1 2014  8:57:54.025
originate timestamp: d789bebe.fca795b1  Mon, Aug  4 2014  8:11:42.986
transmit timestamp:  d789bebe.af230efd  Mon, Aug  4 2014  8:11:42.684
filter delay:  0.02573  0.02570  0.02570  0.02567 
         0.00000  0.00000  0.00000  0.00000 
filter offset: 0.302768 0.302751 0.302744 0.302761
         0.000000 0.000000 0.000000 0.000000
delay 0.02567, dispersion 0.00000
offset 0.302761

4 Aug 08:11:42 ntpdate[19763]: no server suitable for synchronization found

1. in system tests after reverting snapshots we sync system time from HW clock and using ntpdate, but ntpd daemon still has old 'time of last  date sync' ( Fri, Aug  1 2014  8:57:54.025)
2. then we sync time on slave using hwclock (Mon, Aug  4 2014 )
3. and execute ntpdate, which connects to NTP daemon, but drops it due to old 'reference time:'

Revision history for this message

Igor Shishkin (teran) wrote on 2014-08-04:

Artem, that possible reason.
But we should get this error more often in that case.
Besides, environment is completely erasing before system tests, so it doesn't stall much time.

Do we have logs from nodes with that problem on system tests to clarify what really happen inside nodes?

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-08-04:

Igor, I got that error on local environment while testing patch for Ceph and trying to fix "Clock skew" warning and found out that frequently slaves don't sync time from master node due to error described above. That didn't fail of system tests during time synchronizing, but caused problems with Ceph (clock skew warning). Unfortunately, I don't have logs with enabled debug for ntpdate, so they aren't informative.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-08-04:

Sounds reasonable, perhaps we should use 'service ntpd stop; ntpd -qg; service ntpd start; sleep 120' indeed. There is also a related bug for this as well https://bugs.launchpad.net/fuel/+bug/1314958

Changed in fuel:
status:	Confirmed → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-04: Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/111723

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Artem Panchenko (apanchenko-8)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-05: Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/111723
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=6abb9a3b2d0ea720ad69b66a94464b3904c4a798
Submitter: Jenkins
Branch: master

commit 6abb9a3b2d0ea720ad69b66a94464b3904c4a798
Author: Artem Panchenko <email address hidden>
Date: Mon Aug 4 15:27:52 2014 +0300

Don't fail tests if can't connect to NTP servers

    1) In case of temporary problems with internet connection on
    host server ndpd/ntpdate on Fuel master can't synchronize
    time with public NTP servers and system tests fail. Actually,
    it isn't necessarily because slaves sync time from master
    node and cluster time still will be identical, so we should
    not interrupt system tests.
    2) Tests should proceed with synchronizing time on slaves
    only if ntpd on master node is ready for connections.
    3) If ntpdate on slaves is unable to sync time with master
    node, then raise exception and break test.

Change-Id: Ia7580767be2594bacbece3d7c6aae6a9aeffa56d
Closes-bug: #1351279

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-05: Fix proposed to fuel-main (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/112008

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-05: Fix merged to fuel-main (stable/5.0)

#10

Reviewed: https://review.openstack.org/112008
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=7cd4cc1fb4dc7885163bde045682934f440a00a5
Submitter: Jenkins
Branch: stable/5.0

commit 7cd4cc1fb4dc7885163bde045682934f440a00a5
Author: Artem Panchenko <email address hidden>
Date: Tue Aug 5 16:05:27 2014 +0300

Don't fail tests if can't connect to NTP servers

Change-Id: I81f19bbd2a542d540d53ea8bdb42537b83867ee9
Closes-bug: #1351279

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-16: Change abandoned on fuel-main (master)

#11

Change abandoned by Igor Shishkin (<email address hidden>) on branch: master
Review: https://review.openstack.org/111533

Artem Panchenko (apanchenko-8) on 2014-08-20

Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.