So, as we see at the log, fuel-devops couldn't sync time on some node, and this node is not controller.
The difference between syncing time on controllers and other nodes is an NTP host. For controller it is some remote host and for other nodes it is the NTP server which runs on virtual namespace managed by Pacemaker at vip__vrouter resource.
I tried to sync time on compute:
root@node-5:~# ntpdate -d 10.109.1.8
1 Jun 10:31:29 ntpdate[32556]: ntpdate 4.2.6p5@1.2349-o Thu Feb 11 18:30:41 UTC 2016 (1)
Looking for host 10.109.1.8 and service ntp
host found : 10.109.1.8
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
10.109.1.8: Server dropped: no data
server 10.109.1.8, port 123
stratum 0, precision 0, leap 00, trust 000
refid [10.109.1.8], delay 0.00000, dispersion 64.00000
transmitted 4, in filter 4
reference time: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
originate timestamp: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
transmit timestamp: daf93807.6a5ce4c3 Wed, Jun 1 2016 10:31:35.415
filter delay: 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000
filter offset: 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000
delay 0.00000, dispersion 64.00000
offset 0.000000
1 Jun 10:31:37 ntpdate[32556]: no server suitable for synchronization found
Then I tried to get know what's wrong with NTP in namespace, on controller:
root@node-6:~# ip netns exec vrouter netstat -nap|grep :123
So, there is no alive NTP instance
Pacemaker status:
root@node-6:~# pcs status
Cluster name:
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Wed Jun 1 11:46:43 2016 Last change: Tue May 31 12:27:14 2016 by root via cibadmin on node-2.test.domain.local
Stack: corosync
Current DC: node-6.test.domain.local (version 1.1.14-70404b0) - partition with quorum
3 nodes and 46 resources configured
Online: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
Full list of resources:
Clone Set: clone_p_vrouter [p_vrouter]
Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-6.test.domain.local
vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-6.test.domain.local
vip__public (ocf::fuel:ns_IPaddr2): Stopped
Clone Set: clone_p_haproxy [p_haproxy]
Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
Clone Set: clone_p_mysqld [p_mysqld]
Stopped: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Started node-2.test.domain.local
sysinfo_node-6.test.domain.local (ocf::pacemaker:SysInfo): Started node-6.test.domain.local
Master/Slave Set: master_p_conntrackd [p_conntrackd]
Masters: [ node-2.test.domain.local ]
Slaves: [ node-6.test.domain.local ]
Stopped: [ node-1.test.domain.local ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Slaves: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
Clone Set: clone_p_dns [p_dns]
Started: [ node-2.test.domain.local node-6.test.domain.local ]
Stopped: [ node-1.test.domain.local ]
sysinfo_node-1.test.domain.local (ocf::pacemaker:SysInfo): Stopped
Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent]
Started: [ node-2.test.domain.local node-6.test.domain.local ]
Clone Set: clone_neutron-l3-agent [neutron-l3-agent]
Started: [ node-6.test.domain.local ]
Stopped: [ node-2.test.domain.local ]
Clone Set: clone_p_heat-engine [p_heat-engine]
Stopped: [ node-6.test.domain.local ]
Clone Set: clone_neutron-metadata-agent [neutron-metadata-agent]
Clone Set: clone_neutron-dhcp-agent [neutron-dhcp-agent]
Clone Set: clone_p_ntp [p_ntp]
Clone Set: clone_ping_vip__public [ping_vip__public]
Failed Actions:
* p_mysqld_start_0 on node-1.test.domain.local 'unknown error' (1): call=168, status=Timed Out, exitreason='none',
last-rc-change='Tue May 31 13:02:47 2016', queued=0ms, exec=300002ms
* neutron-openvswitch-agent_monitor_20000 on node-6.test.domain.local 'unknown error' (1): call=93, status=Timed Out, exitreason='none',
last-rc-change='Tue May 31 12:53:36 2016', queued=0ms, exec=0ms
* p_mysqld_monitor_60000 on node-6.test.domain.local 'unknown error' (1): call=41, status=complete, exitreason='none',
last-rc-change='Tue May 31 12:54:24 2016', queued=0ms, exec=0ms
* neutron-l3-agent_monitor_20000 on node-6.test.domain.local 'unknown error' (1): call=100, status=Timed Out, exitreason='none',
last-rc-change='Tue May 31 12:53:30 2016', queued=0ms, exec=0ms
* sysinfo_node-6.test.domain.local_monitor_15000 on node-6.test.domain.local 'unknown error' (1): call=47, status=Timed Out, exitreason='none',
last-rc-change='Tue May 31 12:53:28 2016', queued=0ms, exec=0ms
* neutron-openvswitch-agent_monitor_20000 on node-2.test.domain.local 'unknown error' (1): call=96, status=Timed Out, exitreason='none',
last-rc-change='Tue May 31 12:54:35 2016', queued=0ms, exec=0ms
* p_mysqld_start_0 on node-2.test.domain.local 'unknown error' (1): call=326, status=Timed Out, exitreason='none',
last-rc-change='Wed Jun 1 11:37:52 2016', queued=0ms, exec=300004ms
* sysinfo_node-2.test.domain.local_monitor_15000 on node-2.test.domain.local 'unknown error' (1): call=47, status=Timed Out, exitreason='none',
last-rc-change='Tue May 31 12:54:18 2016', queued=0ms, exec=0ms
PCSD Status:
node-1.test.domain.local member (10.109.1.4): Offline
node-2.test.domain.local member (10.109.1.3): Offline
node-6.test.domain.local member (10.109.1.2): Offline
There are many failed actions.
We need to investigate what exactly prevent us to normally run NTP demon.
This is not QA bug, I've analyze all our code related to the problem and we couldn't fix it on our side.
So, as we see at the log, fuel-devops couldn't sync time on some node, and this node is not controller.
The difference between syncing time on controllers and other nodes is an NTP host. For controller it is some remote host and for other nodes it is the NTP server which runs on virtual namespace managed by Pacemaker at vip__vrouter resource.
I tried to sync time on compute:
root@node-5:~# ntpdate -d 10.109.1.8 10.109. 1.8) 10.109. 1.8) 10.109. 1.8) 10.109. 1.8) 10.109. 1.8)
1 Jun 10:31:29 ntpdate[32556]: ntpdate 4.2.6p5@1.2349-o Thu Feb 11 18:30:41 UTC 2016 (1)
Looking for host 10.109.1.8 and service ntp
host found : 10.109.1.8
transmit(
transmit(
transmit(
transmit(
transmit(
10.109.1.8: Server dropped: no data
server 10.109.1.8, port 123
stratum 0, precision 0, leap 00, trust 000
refid [10.109.1.8], delay 0.00000, dispersion 64.00000
transmitted 4, in filter 4
reference time: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
originate timestamp: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
transmit timestamp: daf93807.6a5ce4c3 Wed, Jun 1 2016 10:31:35.415
filter delay: 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000
filter offset: 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000
delay 0.00000, dispersion 64.00000
offset 0.000000
1 Jun 10:31:37 ntpdate[32556]: no server suitable for synchronization found
Then I tried to get know what's wrong with NTP in namespace, on controller:
root@node-6:~# ip netns exec vrouter netstat -nap|grep :123
So, there is no alive NTP instance
Pacemaker status: test.domain. local test.domain. local (version 1.1.14-70404b0) - partition with quorum test.domain. local node-2. test.domain. local node-6. test.domain. local ] test.domain. local node-2. test.domain. local node-6. test.domain. local ] ns_IPaddr2) : Started node-1. test.domain. local ns_IPaddr2) : Started node-6. test.domain. local ns_IPaddr2) : Started node-6. test.domain. local ns_IPaddr2) : Stopped test.domain. local node-2. test.domain. local node-6. test.domain. local ] test.domain. local node-2. test.domain. local node-6. test.domain. local ] node-2. test.domain. local (ocf::pacemaker :SysInfo) : Started node-2. test.domain. local node-6. test.domain. local (ocf::pacemaker :SysInfo) : Started node-6. test.domain. local test.domain. local ] test.domain. local ] test.domain. local ] p_rabbitmq- server [p_rabbitmq-server] test.domain. local node-2. test.domain. local node-6. test.domain. local ] test.domain. local node-6. test.domain. local ] test.domain. local ] node-1. test.domain. local (ocf::pacemaker :SysInfo) : Stopped openvswitch- agent [neutron- openvswitch- agent] test.domain. local node-6. test.domain. local ] l3-agent [neutron-l3-agent] test.domain. local ] test.domain. local ] test.domain. local ] metadata- agent [neutron- metadata- agent] dhcp-agent [neutron- dhcp-agent] vip__public [ping_vip__public] test.domain. local 'unknown error' (1): call=168, status=Timed Out, exitreason='none', rc-change= 'Tue May 31 13:02:47 2016', queued=0ms, exec=300002ms openvswitch- agent_monitor_ 20000 on node-6. test.domain. local 'unknown error' (1): call=93, status=Timed Out, exitreason='none', rc-change= 'Tue May 31 12:53:36 2016', queued=0ms, exec=0ms monitor_ 60000 on node-6. test.domain. local 'unknown error' (1): call=41, status=complete, exitreason='none', rc-change= 'Tue May 31 12:54:24 2016', queued=0ms, exec=0ms l3-agent_ monitor_ 20000 on node-6. test.domain. local 'unknown error' (1): call=100, status=Timed Out, exitreason='none', rc-change= 'Tue May 31 12:53:30 2016', queued=0ms, exec=0ms node-6. test.domain. local_monitor_ 15000 on node-6. test.domain. local 'unknown error' (1): call=47, status=Timed Out, exitreason='none', rc-change= 'Tue May 31 12:53:28 2016', queued=0ms, exec=0ms openvswitch- agent_monitor_ 20000 on node-2. test.domain. local 'unknown error' (1): call=96, status=Timed Out, exitreason='none', rc-change= 'Tue May 31 12:54:35 2016', queued=0ms, exec=0ms test.domain. local 'unknown error' (1): call=326, status=Timed Out, exitreason='none', rc-change= 'Wed Jun 1 11:37:52 2016', queued=0ms, exec=300004ms node-2. test.domain. local_monitor_ 15000 on node-2. test.domain. local 'unknown error' (1): call=47, status=Timed Out, exitreason='none', rc-change= 'Tue May 31 12:54:18 2016', queued=0ms, exec=0ms 1.test. domain. local member (10.109.1.4): Offline 2.test. domain. local member (10.109.1.3): Offline 6.test. domain. local member (10.109.1.2): Offline
root@node-6:~# pcs status
Cluster name:
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Wed Jun 1 11:46:43 2016 Last change: Tue May 31 12:27:14 2016 by root via cibadmin on node-2.
Stack: corosync
Current DC: node-6.
3 nodes and 46 resources configured
Online: [ node-1.
Full list of resources:
Clone Set: clone_p_vrouter [p_vrouter]
Started: [ node-1.
vip__management (ocf::fuel:
vip__vrouter_pub (ocf::fuel:
vip__vrouter (ocf::fuel:
vip__public (ocf::fuel:
Clone Set: clone_p_haproxy [p_haproxy]
Started: [ node-1.
Clone Set: clone_p_mysqld [p_mysqld]
Stopped: [ node-1.
sysinfo_
sysinfo_
Master/Slave Set: master_p_conntrackd [p_conntrackd]
Masters: [ node-2.
Slaves: [ node-6.
Stopped: [ node-1.
Master/Slave Set: master_
Slaves: [ node-1.
Clone Set: clone_p_dns [p_dns]
Started: [ node-2.
Stopped: [ node-1.
sysinfo_
Clone Set: clone_neutron-
Started: [ node-2.
Clone Set: clone_neutron-
Started: [ node-6.
Stopped: [ node-2.
Clone Set: clone_p_heat-engine [p_heat-engine]
Stopped: [ node-6.
Clone Set: clone_neutron-
Clone Set: clone_neutron-
Clone Set: clone_p_ntp [p_ntp]
Clone Set: clone_ping_
Failed Actions:
* p_mysqld_start_0 on node-1.
last-
* neutron-
last-
* p_mysqld_
last-
* neutron-
last-
* sysinfo_
last-
* neutron-
last-
* p_mysqld_start_0 on node-2.
last-
* sysinfo_
last-
PCSD Status:
node-
node-
node-
There are many failed actions.
We need to investigate what exactly prevent us to normally run NTP demon.
This is not QA bug, I've analyze all our code related to the problem and we couldn't fix it on our side.