Ci: new ntp-wait command in devstack-gate hangs for 100 minutes which breaks jobs

Bug #1608226 reported by Sagi (Sergey) Shnaidman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sagi (Sergey) Shnaidman

Bug Description

After last commit to devstack-gate repo: https://github.com/openstack-infra/devstack-gate/commit/53170ea72a3e86a02a750d90e983be55400d61e2
there is new command ntp-wait that tries to sync with NTP servers and hangs for 6 secs * 1000 tries = 60000 seconds (http://linux.die.net/man/8/ntp-wait)

It completely breaks all jobs because of this huge timeout:
http://logs.openstack.org/63/347463/12/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/1bf9f69/console.html#_2016-07-30_23_32_37_551752

non-ha jobs pass because ntp-wait isn't found there:
sudo: /usr/sbin/ntp-wait: command not found
http://logs.openstack.org/63/347463/11/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/ec6b357/console.html#_2016-07-29_20_47_06_780246

Tags: ci
description: updated
Changed in tripleo:
importance: Undecided → Critical
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
tags: added: alert
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Patch about timeouts in devstack script:
https://review.openstack.org/#/c/349261/

Next we need to fix NTP issues on slaves

Revision history for this message
Paul Belanger (pabelanger) wrote :

This is likely a results of some sort of upstream firewall / network issue with tripleo-test-cloud-rh2.

We don't have this issue in rax-iad:

[pabelanger@centos-7-2-node-rax-iad-3049697 ~]$ ntp-wait -v
Waiting for ntpd to synchronize... OK!

We need a network admin to debug this.

Revision history for this message
Paul Belanger (pabelanger) wrote :
Revision history for this message
James Slagle (james-slagle) wrote :

I booted an instance in rh2 to test ntp.

It does look like some sort of firewall issue. Outbound ntp traffic works, but
syncs fail:

[root@slagle-multinode-ha-2 ~]# ping 0.centos.pool.ntp.org
PING 0.centos.pool.ntp.org (171.66.97.126) 56(84) bytes of data.
64 bytes from srcf-ntp.stanford.edu (171.66.97.126): icmp_seq=1 ttl=53 time=77.4 ms
64 bytes from srcf-ntp.stanford.edu (171.66.97.126): icmp_seq=2 ttl=53 time=77.0 ms
^C
--- 0.centos.pool.ntp.org ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 77.056/77.269/77.483/0.350 ms
[root@slagle-multinode-ha-2 ~]#
[root@slagle-multinode-ha-2 ~]# nc 0.centos.pool.ntp.org 123 -v -u
Ncat: Version 6.40 ( http://nmap.org/ncat )
Ncat: Connected to 171.66.97.126:123.
^C
[root@slagle-multinode-ha-2 ~]#
[root@slagle-multinode-ha-2 ~]# ntpdate -q 0.centos.pool.ntp.org
server 4.53.160.75, stratum 2, offset 0.008555, delay 0.05557
server 96.126.112.93, stratum 0, offset 0.000000, delay 0.00000
server 216.218.254.202, stratum 1, offset 0.007480, delay 0.09669
server 64.79.107.43, stratum 2, offset 0.006054, delay 0.05096
31 Jul 17:14:16 ntpdate[10076]: adjust time server 216.218.254.202 offset 0.007480 sec
[root@slagle-multinode-ha-2 ~]# echo $?
0

An attempt to actually synchronize fails:

[root@slagle-multinode-ha-2 ~]# ntpdate -v 0.centos.pool.ntp.org
31 Jul 17:14:48 ntpdate[10086]: ntpdate 4.2.6p5@1.2349-o Tue May 31 10:28:55 UTC 2016 (1)
31 Jul 17:14:56 ntpdate[10086]: no server suitable for synchronization found
[root@slagle-multinode-ha-2 ~]# echo $?
1

Strangely, adding -d to the above command caused it to work:

[root@slagle-multinode-ha-2 ~]# ntpdate -dv 0.centos.pool.ntp.org
31 Jul 17:16:30 ntpdate[10096]: ntpdate 4.2.6p5@1.2349-o Tue May 31 10:28:55 UTC 2016 (1)
Looking for host 0.centos.pool.ntp.org and service ntp
host found : 4.53.160.75
transmit(4.53.160.75)
receive(4.53.160.75)
transmit(96.126.112.93)
<snip>
31 Jul 17:16:37 ntpdate[10096]: adjust time server 216.218.254.202 offset 0.007999 sec
[root@slagle-multinode-ha-2 ~]# echo $?
0

Which led me to:
https://access.redhat.com/solutions/31861
which indicates -d always use an unprivileged port, so incoming traffic to port
123 is likely blocked.

On the bastion I see port 123 open for tcp, does it need to be open for udp as
well?

Revision history for this message
Paul Belanger (pabelanger) wrote :

yes, ntpd with use udp/123

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

It's not firewall issue, because ntpdate works like a charm with the same UPD 123 ports:

18:23:08.874211 IP (tos 0x0, ttl 64, id 16086, offset 0, flags [DF], proto UDP (17), length 76)
    192.168.102.248.34162 > 129.250.35.250.ntp: [udp sum ok] NTPv4, length 48
        Client, Leap indicator: clock unsynchronized (192), Stratum 0 (unspecified), poll 3 (8s), precision -6
        Root Delay: 1.000000, Root dispersion: 1.000000, Reference-ID: (unspec)
          Reference Timestamp: 0.000000000
          Originator Timestamp: 0.000000000
          Receive Timestamp: 0.000000000
          Transmit Timestamp: 3678978188.874171018 (2016/07/31 18:23:08)
            Originator - Receive Timestamp: 0.000000000
            Originator - Transmit Timestamp: 3678978188.874171018 (2016/07/31 18:23:08)
18:23:08.883065 IP (tos 0x0, ttl 55, id 0, offset 0, flags [DF], proto UDP (17), length 76)
    129.250.35.250.ntp > 192.168.102.248.34162: [udp sum ok] NTPv4, length 48
        Server, Leap indicator: (0), Stratum 2 (secondary reference), poll 3 (8s), precision -20
        Root Delay: 0.055389, Root dispersion: 0.002609, Reference-ID: 249.224.99.213
          Reference Timestamp: 3678977631.526893913 (2016/07/31 18:13:51)
          Originator Timestamp: 3678978188.874171018 (2016/07/31 18:23:08)
          Receive Timestamp: 3678978188.829251885 (2016/07/31 18:23:08)
          Transmit Timestamp: 3678978188.829276561 (2016/07/31 18:23:08)
            Originator - Receive Timestamp: -0.044919114
            Originator - Transmit Timestamp: -0.044894441

Another difference between ntpdate and ntpq - ntpdate sends packets with NTPv4, but ntpq does with NTPv2.

18:25:12.049055 IP (tos 0x0, ttl 64, id 24057, offset 0, flags [DF], proto UDP (17), length 40)
    192.168.102.248.45063 > 132.163.4.101.ntp: [udp sum ok] NTPv2, length 12
        Reserved, Leap indicator: (0), Stratum 1 (primary reference), poll 0 (1s), precision 1
        Root Delay: 0.000000, Root dispersion: 0.000000 [|ntp]
18:25:17.054063 IP (tos 0x0, ttl 64, id 24058, offset 0, flags [DF], proto UDP (17), length 40)
    192.168.102.248.45063 > 132.163.4.101.ntp: [udp sum ok] NTPv2, length 12
        Reserved, Leap indicator: (0), Stratum 1 (primary reference), poll 0 (1s), precision 1
        Root Delay: 0.000000, Root dispersion: 0.000000 [|ntp]

However switching ntpq to version 4 didn't help:

18:31:01.074943 IP (tos 0x0, ttl 64, id 13471, offset 0, flags [DF], proto UDP (17), length 40)
    192.168.102.248.33748 > 171.66.97.126.ntp: [udp sum ok] NTPv4, length 12
        Reserved, Leap indicator: (0), Stratum 2 (secondary reference), poll 0 (1s), precision 6
        Root Delay: 0.000000, Root dispersion: 0.000000 [|ntp]
18:31:06.080589 IP (tos 0x0, ttl 64, id 13472, offset 0, flags [DF], proto UDP (17), length 40)
    192.168.102.248.33748 > 171.66.97.126.ntp: [udp sum ok] NTPv4, length 12
        Reserved, Leap indicator: (0), Stratum 2 (secondary reference), poll 0 (1s), precision 6
        Root Delay: 0.000000, Root dispersion: 0.000000 [|ntp]

Revision history for this message
Paul Belanger (pabelanger) wrote :

ntpdate != ntpd

For ntpd to work, the upstream search establishes an incoming udp/123 then the exchange happens. ntpdate just established an outbound connection.

The following should be setup on the bastion host:

# iptables -A OUTPUT -p udp --dport 123 -j ACCEPT
# iptables -A INPUT -p udp --sport 123 -j ACCEPT

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Unfortunately I think I don't have access to see rules there. On slave host iptables --flush didn't help.

tags: removed: alert
Revision history for this message
Ian Wienand (iwienand) wrote :

Note that I saw a smaller timeout with and filed [1]. I'm trying to understand why restarting the ntp daemon makes it work.

Note the other suggestions in the bug was to use ntpdate to bring things into sync, or switch to chrony. it's not exactly clear why ntpdate is deprecated? [2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1361382
[2] https://review.openstack.org/#/c/299677/

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Ian, ntpdate functionality is replaced with ntpd so it's deprecated for not to duplicate tools as I understand. (It's funny that the feature which would help us to workaround this issue in fact is not supported in ntpd - sending packets from unpriviledged ports.)

It really seems like firewall issue, I tried with ntpdate:

-bash-4.2# ntpdate 1.centos.pool.ntp.org
 1 Aug 08:25:46 ntpdate[5359]: no server suitable for synchronization found

-bash-4.2# ntpdate -u 1.centos.pool.ntp.org
 1 Aug 08:25:59 ntpdate[5360]: step time server 107.21.100.177 offset -1.063075 sec

when using -u it sends packets from arbitrary UDP port (above 1024), and it can receive answers. When using without -u it sends packets from 123 port, exactly as ntpd does, and then it can't receive the answer. The packets don't reach destination if they go out from unprivileged ports.

Revision history for this message
James Slagle (james-slagle) wrote :

since the public centos ntp servers seem to be completely firewalled off in rh2, i added the following to /etc/neutron/dnsmasq-neutron.conf on overcloud-controller-0:

dhcp-option-force=42,<bastion ip>

which will set the bastion IP as the ntp server for launched instances on the overcloud.

that worked in my local testing, but does not seem to be working in CI itself. Here's an example job:
http://logs.openstack.org/54/341354/18/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/ddd54a6/

on the undercloud in /etc/ntp.conf, the server is added, but we still get the 10 minute wait for ntp during the run:

 155 2016-08-01 14:44:54.267399 | + ./safe-devstack-vm-gate-wrap.sh
 156 2016-08-01 14:54:55.020345 | Waiting for ntpd to synchronize... ^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^ H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^H*^H+^H:^H.^ HNo!
 157 2016-08-01 14:54:55.020470 | ntpd did not synchronize.

Revision history for this message
Paul Belanger (pabelanger) wrote :

We recently made changes to openstack-infra to stop allowing DHCP to setup domain / search, I believe having DHCP setup NTP server is also not correct.

I think it is a good test to confirm this something upstream of tripleo-test-cloud-rh2 is firewalling NTP, but we need to address that network issue, not add workarounds for it.

Both tripleo-test-cloud-rh(1|2) should have direct pipe to the internet, which not filtering of any kind. We should be treating triple-test-clouds as public clouds, not private behind some sort of appliance. I understand that will be more work, but if we want tripleo-test-cloud-rh(1|2) to continue using nodepool.openstack.org, we should work like other public clouds.

Revision history for this message
Paul Belanger (pabelanger) wrote :

Okay, I just brought tripleo-test-cloud-rh1 backonline, and ntp-wait works as expected. However I had to restart ntpd like ianw suggested:

\Aug 1 19:30:15 localhost ntpd[332]: ntpd 4.2.6p5@1.2349-o Tue May 31 10:28:54 UTC 2016 (1)
Aug 1 19:30:16 localhost ntpd[338]: proto: precision = 0.142 usec
Aug 1 19:30:16 localhost ntpd[338]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Aug 1 19:30:16 localhost ntpd[338]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Aug 1 19:30:16 localhost ntpd[338]: Listen and drop on 1 v6wildcard :: UDP 123
Aug 1 19:30:16 localhost ntpd[338]: Listen normally on 2 lo 127.0.0.1 UDP 123
Aug 1 19:30:16 localhost ntpd[338]: Listen normally on 3 lo ::1 UDP 123
Aug 1 19:30:16 localhost ntpd[338]: Listening on routing socket on fd #20 for interface updates
Aug 1 19:30:16 localhost ntpd[338]: Deferring DNS for 0.centos.pool.ntp.org 1
Aug 1 19:30:16 localhost ntpd[338]: Deferring DNS for 1.centos.pool.ntp.org 1
Aug 1 19:30:16 localhost ntpd[338]: Deferring DNS for 2.centos.pool.ntp.org 1
Aug 1 19:30:16 localhost ntpd[338]: 0.0.0.0 c016 06 restart
Aug 1 19:30:16 localhost ntpd[338]: 0.0.0.0 c012 02 freq_set kernel 0.000 PPM
Aug 1 19:30:16 localhost ntpd[338]: 0.0.0.0 c011 01 freq_not_set
Aug 1 19:30:18 localhost ntpd_intres[349]: host name not found: 0.centos.pool.ntp.org
Aug 1 19:30:18 localhost ntpd_intres[349]: host name not found: 1.centos.pool.ntp.org
Aug 1 19:30:18 localhost ntpd_intres[349]: host name not found: 2.centos.pool.ntp.org
Aug 1 19:30:21 localhost ntpd[338]: Listen normally on 4 eth0 192.168.101.92 UDP 123
Aug 1 19:30:21 localhost ntpd[338]: Listen normally on 5 eth0 fe80::f816:3eff:fea4:ac73 UDP 123
Aug 1 19:30:23 localhost ntpd_intres[349]: host name not found: 0.centos.pool.ntp.org
Aug 1 19:30:23 localhost ntpd_intres[349]: host name not found: 1.centos.pool.ntp.org
Aug 1 19:30:23 localhost ntpd_intres[349]: host name not found: 2.centos.pool.ntp.org
Aug 1 19:32:25 localhost ntpd_intres[349]: DNS 0.centos.pool.ntp.org -> 104.232.3.3
Aug 1 19:32:25 localhost ntpd_intres[349]: DNS 1.centos.pool.ntp.org -> 4.53.160.75
Aug 1 19:32:25 localhost ntpd_intres[349]: DNS 2.centos.pool.ntp.org -> 66.135.44.92

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

We have a workaround already in https://review.openstack.org/#/c/349261/
If it disappears when moving to rh1, then it's problem of environment and I'll close this bug.

Changed in tripleo:
status: New → Triaged
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.