Cannot sync time: Failed to wait peer on node

Bug #1656020 reported by Sergey Novikov
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
In Progress
High
Fuel QA Team
Mitaka
Fix Released
High
Dmitry Ilyin
Newton
Confirmed
High
Fuel QA Team
Ocata
Confirmed
High
Fuel QA Team

Bug Description

Detailed bug description:

during time sync by devops the following exception is raised

2017-01-12 04:21:34 - ERROR decorators.py:126 -- Traceback (most recent call last):
  File "/home/jenkins/workspace/9.x.system_test.ubuntu.cgroup_ha/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.x.system_test.ubuntu.cgroup_ha/fuelweb_test/tests/test_cgroups.py", line 479, in apply_cgroups_reboot_node
    self.fuel_web.cold_restart_nodes([target_controller])
  File "/home/jenkins/workspace/9.x.system_test.ubuntu.cgroup_ha/fuelweb_test/models/fuel_web_client.py", line 2219, in cold_restart_nodes
    self.environment.sync_time()
  File "/home/jenkins/workspace/9.x.system_test.ubuntu.cgroup_ha/fuelweb_test/models/environment.py", line 147, in sync_time
    new_time = sync_time(self.d_env, nodes_names, skip_sync)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/retry.py", line 27, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 43, in sync_time
    g_ntp.do_sync_time(g_ntp.pacemaker_ntps)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 340, in do_sync_time
    ntp.wait_peer()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 158, in wait_peer
    self._node_name))
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 111, in wait
    raise TimeoutError(timeout_msg)
TimeoutError: Failed to wait peer on node u'slave-01'

The issue was found by
https://product-heci.infra.mirantis.net/job/9.x.system_test.ubuntu.cgroup_ha/170/testReport/(root)/apply_cgroups_reboot_node/
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ready_node_reinstallation/167/testReport/

this behavior was observed on srv88-bud.infra.mirantis.net and srv49-bud.infra.mirantis.net

Steps to reproduce:
1. deploy env with 3 controllers
2. find controller node with
root@node-1:~# ps axu | grep ntp
ntp 26223 0.0 0.0 27100 3612 ? Ss 15:45 0:00 /usr/sbin/ntpd -u ntp:ntp -p /var/run/ntpd.pid -4 -g -c /etc/ntp.conf
ntp 28364 0.0 0.1 27224 4444 ? Ss 15:46 0:00 ntpd -n -g -u 103:107

3. migrate all vips psc resources from node(vip__management, vip__vrouter_pub, vip__vrouter, vip__public)

4. stop/start ntpd
# crm resource stop p_ntp
# crm resource start p_ntp

5. execute
# ip netns exec vrouter ntpq -pn 127.0.0.1

Expected results:
 > remote refid st t when poll reach delay offset jitter
 > ==============================================================================
 > *10.109.8.1 81.2.254.224 3 u 7 8 377 0.193 -0.032 0.053
Actual result:
 > remote refid st t when poll reach delay offset jitter
 > ==============================================================================
 > 10.109.8.1 .INIT. 16 - - 8 0 0.000 0.000 0.000

So, it's a strange bug...This behavior is observed for only controller node with 2 ntpd processes.
The process
"/usr/sbin/ntpd -u ntp:ntp -p /var/run/ntpd.pid -4 -g -c /etc/ntp.conf" is managed by pacemaker and
process "ntpd -n -g -u 103:107" is initialized by upstart, as I understood.

Revision history for this message
Sergey Novikov (snovikov) wrote :
Changed in fuel:
assignee: nobody → Stanislaw Bogatkin (sbogatkin)
Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → Oleksiy Molchanov (omolchanov)
assignee: Oleksiy Molchanov (omolchanov) → Stanislaw Bogatkin (sbogatkin)
Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Our ntp package contain ntp upstart and systemV scripts, but original package doesn't contain upstart. This breaks our ntp-pacemaker deployment.

http://packages.ubuntu.com/ru/trusty/amd64/ntp/filelist

root@node-6:~# dpkg -S /etc/init.d/ntp
ntp: /etc/init.d/ntp

root@node-6:~# dpkg -S /etc/init/ntp.conf
ntp: /etc/init/ntp.conf

root@node-6:~# apt-cache policy ntp
ntp:
 Installed: 2:4.2.6.p5+dfsg-3~u14.04+mos1
 Candidate: 2:4.2.6.p5+dfsg-3~u14.04+mos1
 Version table:
*** 2:4.2.6.p5+dfsg-3~u14.04+mos1 0
      1200 http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-latest/ mos9.0-proposed/main amd64 Packages
      1050 http://mirror.fuel-infra.org/mos-repos/ubuntu/9.0/ mos9.0-updates/main amd64 Packages
       100 /var/lib/dpkg/status
    1:4.2.6.p5+dfsg-3ubuntu2.14.04.10 0
       500 http://archive.ubuntu.com/ubuntu/ trusty-security/main amd64 Packages
       500 http://archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → MOS Linux (mos-linux)
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Upstart script was added to fix a bug [1].
If upstart script doesn't needed - just disable it with .override file.

[1] https://bugs.launchpad.net/fuel/+bug/1585751

Changed in fuel:
assignee: MOS Linux (mos-linux) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: Undecided → High
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Dmitry Ilyin (idv1985)
tags: added: area-library
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Since amount of failed cases tag "swarm-blocker" was added.

tags: added: swarm-blocker
removed: swarm-fail
Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

Got this bug on srv20; there is no "vr-host-ns@if19" in "vrouter" namespace right after deployment:

node-1 - no ntp available http://paste.openstack.org/show/595163/
node-5 - non-primary node, ntp available http://paste.openstack.org/show/595164/

BUT after "crm resource migrate vip__vrouter node-1.test.domain.local" it was created and ntp functionality restored:
http://paste.openstack.org/show/595165/

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/ntp (9.0)

Fix proposed to branch: 9.0
Change author: Vladimir Kuklin <email address hidden>
Review: https://review.fuel-infra.org/29900

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote : Re: [9.x swarm] Cannot sync time: Failed to wait peer on node
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/ntp (9.0)

Fix proposed to branch: 9.0
Change author: Dmitry Teselkin <email address hidden>
Review: https://review.fuel-infra.org/29901

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/ntp (9.0)

Reviewed: https://review.fuel-infra.org/29901
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0

Commit: 94072570a6571da43eda1250d7e3ea251b006797
Author: Dmitry Teselkin <email address hidden>
Date: Tue Jan 17 16:01:17 2017

Remove init.d script

Remove init.d script and replace it with compatibility symlink to upstart.

Change-Id: Ib52ba1b6c9855439919131de7152cc8d5f9875ac
Closes-bug: #1656020

Revision history for this message
Nastya Urlapova (aurlapova) wrote : Re: [9.x swarm] Cannot sync time: Failed to wait peer on node
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

There is no 'vr-host-ns@*' interface in 'vrouter' namespace on node-1 and wasn't created during vip migration.
It was found that 'vr-host-ns' interface in 'vrouter' namespace wasn't created after a reboot and the issues is likely in ocf scripts.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

@Dmitry, I have opened new bug for this - https://bugs.launchpad.net/fuel/+bug/1657745

This patch should fix reported issue - https://review.openstack.org/#/q/I0118a3bf33a7450b7dc92a65efd2977b7d9eebda,n,z

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

There are no described fail on snapshot #796 SWARM

Revision history for this message
Vitalii Gridnev (vgridnev) wrote :
Revision history for this message
Vitalii Gridnev (vgridnev) wrote :
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Alexander Ignatov (aignatov) wrote :
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
summary: - [9.x swarm] Cannot sync time: Failed to wait peer on node
+ Cannot sync time: Failed to wait peer on node
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

This issue is related to pacemaker, it is out of cluster on failed node with message in logs

Mar 15 05:37:34 [19858] node-2.test.domain.local pengine: notice: cluster_status: We do not have quorum - fencing and resource management disabled

based on logs from comment #27

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

If we look into last snapshot logs, we will see that there was a server reached:

> *81.2.248.189 195.113.144.238 2 u 75 64 102 7.526 -0.880 0.874\n

but there are problem with it - reach value is 102, which means that it is unstable. All last values are: 241, 120, 250, 124, 52, 25 - and all of them are unstable, which means that this server is a bit of problem, but it still can be used as reference. I suppose that we should in long term improve our algorithm to detect problematic servers (for example, pass unstable references if there are more than one second references) but in short term we can just remove requirement to discard unstable one, as it still used for time sync.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-devops (master)

Fix proposed to branch: master
Review: https://review.openstack.org/450796

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Download full text (3.3 KiB)

This bug (doubling ntpd service) and it's consequences (flapping ntp peers) is caused because of issue in the testcases.

After rebooting the node with pacemaker, testcase *MUST* use the following method to ensure that all pacemaker resources were started after the reboot:

self.fuel_web.assert_ha_services_ready(cluster_id)

Please add this method to all cases where deployed controllers are rebooted.
This is required *NOT* only for time sync. Working pacemaker resources are mandatory for further cluster functionality, test case should not continue until all necessary resources are considered as started.

--------------------
Details:

1. ntp.py from fuel-devops tries to detect how to control the "ntp" service: via pacemaker, via 'service' command or via 'systemctl' command.

2. Right after reboot of the node, pacemaker resources are not started, so the pacemaker check is failed (see an example below):
2017-03-30 04:04:15 - DEBUG - ssh_client.py:868 -- 'ps -C pacemakerd && crm_resource --resource p_ntp --locate' execution results:
Exit code: EX_ERROR<1(0x01)>

3. ntp.py found and init script in /etc/init.d/ntp.* and assumed that the "ntp" is controlled with this script.

4. After ntp.py starts 'ntp' service with init script, pacemaker is awaking and starts a second instance of 'ntp' service in namespace.

Here is an example what is going on after the reboot:
=====================================================
root@node-4:~# ps -C pacemakerd
  PID TTY TIME CMD
 5053 ? 00:00:00 pacemakerd

root@node-4:~# crm_resource --list
 ...
 Clone Set: clone_p_ntp [p_ntp]

root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate
Error performing operation: No such device or address
root@node-4:~# crm_resource --resource p_ntp --locate

root@node-4:~# crm_resource --list
 Clone Set: clone_p_ntp [p_ntp]
     Stopped: [ node-4.test.domain.local node-5.test.domain.local ]

root@node-4:~# crm_resource --resource p_ntp --locate
resource p_ntp is NOT running
resource p_ntp is NOT running
resource p_ntp is NOT running
root@node-4:~# crm_resource --resource p_ntp --locate
resource p_ntp is NOT running
resource p_ntp is NOT running
resource p_ntp is NOT running
root@node-4:~# crm_resource --resourc...

Read more...

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

In this case, problem with flapping server will stay but masked by other means. So my fix is related to this bug but testcase problems should be fixed by qa team, so I reassign this bug.

Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → Fuel QA Team (fuel-qa)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/451743

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-devops (master)

Reviewed: https://review.openstack.org/450796
Committed: https://git.openstack.org/cgit/openstack/fuel-devops/commit/?id=4da6c0c1116e97f50bca54db4a687717c782531c
Submitter: Jenkins
Branch: master

commit 4da6c0c1116e97f50bca54db4a687717c782531c
Author: Stanislaw Bogatkin <email address hidden>
Date: Tue Mar 28 16:54:04 2017 +0300

    Remove requirement to have strong reachability to remote server

    If remote server was selected as reference, it can be used as time
    source, so there are no big problem if it flaps.

    Change-Id: I01fa7acde7c9cd97245e8b2bf8ca31f3c62d6479
    Related-Bug: #1656020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/451807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/451808

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-devops (master)

Reviewed: https://review.openstack.org/451749
Committed: https://git.openstack.org/cgit/openstack/fuel-devops/commit/?id=6d4b22107efca67d62f14d16776503c3e5b3410f
Submitter: Jenkins
Branch: master

commit 6d4b22107efca67d62f14d16776503c3e5b3410f
Author: Dennis Dmitriev <email address hidden>
Date: Thu Mar 30 13:40:34 2017 +0300

    Fix 'ntpd -gq' false-positive exit code 0

    Command 'ntpd -gq' is finished with exit code 0 even if
    "no servers found" appeared.

    Change-Id: Ib45afe7fccca91b402c8f370ede8f963e550921f
    Related-Bug: #1656020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-qa (stable/newton)

Reviewed: https://review.openstack.org/451743
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=0c0a9359697afa353840f235577e2f6c696e5a94
Submitter: Jenkins
Branch: stable/newton

commit 0c0a9359697afa353840f235577e2f6c696e5a94
Author: Vladimir Khlyunev <email address hidden>
Date: Thu Mar 30 14:21:52 2017 +0400

    Do not sync time after restarting of node

    We have weird bug that can spawn 2 ntpd processes at the same time
    by HA mechanism. Actually we have to sync time strictly only after
    reverting a snapshot.

    Change-Id: Ic73eb942eaacb436629841f9b12df844f3ac736d
    Related-bug:1656020

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-qa (master)

Change abandoned by Vladimir Khlyunev (<email address hidden>) on branch: master
Review: https://review.openstack.org/451808
Reason: more than 1 year old

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-qa (stable/ocata)

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: stable/ocata
Review: https://review.opendev.org/451807
Reason: This repo is retired now, no further work will get merged.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.