[BVT] Deployment failed with Failed to execute hook 'sync_time' command: cd / && ntpdate -u

Bug #1533082 reported by Tatyanka
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
In Progress
High
Stanislaw Bogatkin
8.0.x
Confirmed
Critical
Stanislaw Bogatkin
Mitaka
In Progress
High
Stanislaw Bogatkin

Bug Description

https://product-ci.infra.mirantis.net/view/9.0-liberty/job/9.0-liberty.ubuntu.bvt_2/55/

Deploy ceph HA with RadosGW for objects
Scenario:
1. Create cluster with Neutron
2. Add 3 nodes with controller role
3. Add 3 nodes with compute and ceph-osd role
4. Deploy the cluster
5. Check ceph status
6. Run OSTF tests
7. Check the radosqw daemon is started
Duration 90m
Snapshot ceph_rados_gw

Actual result:
Deployment failed with
Task 'deploy' has incorrect status. error != ready, 'Deployment has failed. Method granular_deploy. Failed to execute hook 'sync_time' command: cd / && ntpdate -u $(awk '/^server/ { if ($2 !~ /127\.127\.[0-9]+\.[0-9]+/) {ORS=" "; print $2}}' /etc/ntp.conf)

Task: ccb570a8-8835-41a8-bc24-614b80368fea: shell timeout error: execution expired
Task timeout: 180, Retries: 10

---
uids:
- '1'
- '3'
- '2'
- '5'
- '4'
- '6'
parameters:
  retries: 10
  cmd: ntpdate -u $(awk '/^server/ { if ($2 !~ /127\.127\.[0-9]+\.[0-9]+/) {ORS="
    "; print $2}}' /etc/ntp.conf)
  cwd: /
  timeout: 180
  interval: 30
priority: 700
fail_on_error: true
type: shell
id: sync_time
.
Inspect Astute logs for the details'

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
status: New → Confirmed
Stanislav Makar (smakar)
description: updated
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislav Makar (smakar)
description: updated
Changed in fuel:
assignee: Stanislav Makar (smakar) → Stanislaw Bogatkin (sbogatkin)
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

I reverted 2 environments with this problem - and they sync okay just after revert. So, problem looks like intermittent problem with pustream NTP server. Need some time for reproducing this on local env and on CI envs again.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Next environment sync time right - it works as usual, so I suppose that it was temporarily problem with hypervisor host NTP server or problem with upstream servers it try to sync with. Network problems on master node also could prevent NTP sync. Anyway - I set it to incomplete, cause it doesn't reproduce.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

The bug looks pretty close to https://bugs.launchpad.net/fuel/+bug/1504493

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Ok, let's wait for the new results https://product-ci.infra.mirantis.net/view/9.0-liberty/job/9.0-liberty.ubuntu.bvt_2/56/ and mark this issue as Fix Released if this temporary issues will be fixed.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The latest execution of the issue passed:
https://product-ci.infra.mirantis.net/view/9.0-liberty/job/9.0-liberty.ubuntu.bvt_2/

status changed to Fix Released because it was probably some temporary issue which is already fixed (accordingly to comment #3).

Changed in fuel:
status: Incomplete → Fix Released
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

nothing was fixed, status changed to invalid

Changed in fuel:
status: Fix Released → Invalid
milestone: 9.0 → 8.0
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Reproduced again during swarm tests run https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.cluster_actions/105/testReport/(root)/deploy_neutron_stop_reset_on_provisioning/deploy_neutron_stop_reset_on_provisioning/

Looks like we have some network issues during this tasks execution, but at the same time after revert of environment task passed, so may be we need to add some retries here?

Changed in fuel:
status: Invalid → Confirmed
importance: Critical → High
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Sorry, I still cannot reproduce this bug. I suppose that it can be problem with upstream server master node tries to sync or it could be some network issues when network doesn't work for long time. We can add some timeouts to sync_time task, but there is enough time right now - 300s.

Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → Fuel Library Team (fuel-library)
status: Confirmed → Incomplete
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Dmitry Kaigarodеsev (dkaiharodsev) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/269790

tags: added: swarm-blocker
tags: added: team-bugfix
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :
Revision history for this message
Vitalii Gridnev (vgridnev) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/269790
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=78b94ef6af4a46d163521307da741567bbe83552
Submitter: Jenkins
Branch: master

commit 78b94ef6af4a46d163521307da741567bbe83552
Author: Stanislaw Bogatkin <email address hidden>
Date: Tue Jan 19 20:53:29 2016 +0300

    Add verbosity to sync_time task

    Give task which sync time some verbosity and move output to syslog.

    Change-Id: I7bd365ecccfdd76c09ed61dfcd93ec5de2f3c0ae
    Related-Bug: #1533082

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/270757

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

If we will look on https://ci.fuel-infra.org/job/master.fuel-library.pkgs.ubuntu.neutron_vlan_ha/5191/consoleFull, we see a problem that ntpdate on node-2 was ran, but didn't return any result. It looks as hang, so patch with timeout was proposed. I set workflow '-1' to it for proper testing. When it will be tested - we decide if it can help and should be merged or it should be abandoned.

Revision history for this message
Alexey Galkin (agalkin) wrote :

Reproduced on mos-8.0 iso: #465

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Review: https://review.openstack.org/269790 absolutely useless because ntpdate started just before syslog write usefull information to the logfile:

[root@fuel ~]# grep -ri ntpdate /var/log/docker-logs/remote/
[root@fuel ~]#

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Michael - you are right. But there are no alternative to ntpdate. Its functionality didn't move to ntpd (and not moved today yet)

Maksim - you are wrong, cause '-r' flag for grep try to find _IN_ file content not considering file name. And syslog writes ntpdate log w/o application name - it's why you can't find the log such way.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Maksim - try to use find /var/log/docker-logs/remote -type f -iname '*ntpdate*' - it will show you where those logs are located.

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

Guys, did you try check mcollective/rabbit?
i cough same issue in isolated labs, and :
1) change granular tak to something like
...
cmd: echo zz > /tm/test ; ntpdate -u.... ; cmd: echo xx > /tm/test ;
...

2) re-run deployment
And found that issue
1) NOT reproduced with manual re-run of task (after deployment file(so we need full-reset ))
2) No test file exists at all, no failed nodes

Thats mean, that task not ran on "some" nodes at all - since Astute told that task failed on all nodes.
I suppose issue related to "some node" unavailability for mco in task-start time.

I propose dig in mcollective way, and maybe add some self-check like 'mco ping' before start granular task.

Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/mitaka
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Good point, thank you @Michael. This enhancement shall be done to the astute shell agent as well

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The latest BVT job for MOS 8.0 #484 failed because of the issue

https://product-ci.infra.mirantis.net/job/8.0.ubuntu.bvt_2/436/testReport/(root)/ceph_rados_gw/ceph_rados_gw/

AssertionError: Task 'deploy' has incorrect status. error != ready, 'Deployment has failed. Method granular_deploy. Failed to execute hook 'sync_time' command: cd / && ntpdate -u $(awk '/^server/ { if ($2 !~ /127\.127\.[0-9]+\.[0-9]+/) {ORS=" "; print $2}}' /etc/ntp.conf)

Task: 7127cd48-ce79-4483-a4c9-0d898b698b26: shell timeout error: execution expired
Task timeout: 180, Retries: 10

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Comment from Sergii Golovatiuk:

We are running custom BVT with strace [1] to catch [2]

[1] https://review.openstack.org/#/c/271219/3
[2] https://bugs.launchpad.net/fuel/+bug/1533082

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Guys, about 9.0. We will move to new task based deployment which run such tasks using puppet. Astute will decorate shell call by puppet manifests and all logs you will find in puppet logs. So we do not need to waste time on new shell mcollective agent.
In case of 8.0, on current release state we could not update agent, because it too big change.

I think we should check this bug again when task based will be enable as default for 9.0

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

@mpolenchuk, thank you for link about new shell agent. I will try to check it, it will be second variant in case if puppet method does not work.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Unfortunately, there is no evidence from mco/puppet side about this problem. If we try to run this task with '-d' flag, it shows that ntpdate successfully ran, send one package, then just hangs:

^_^ -> less node-4.test.domain.local/ntpdate.log
2016-01-25T18:45:57.616378+00:00 notice: transmit(10.109.16.2)
2016-01-25T18:45:57.616450+00:00 notice: receive(10.109.16.2)
node-4.test.domain.local/ntpdate.log (END)

At this exact moment each other node ran ntpdate succesfully:
^_^ -> less node-5.test.domain.local/ntpdate.log
2016-01-25T18:45:55.613334+00:00 notice: transmit(10.109.16.2)
2016-01-25T18:45:55.613894+00:00 notice: receive(10.109.16.2)
2016-01-25T18:45:57.613337+00:00 notice: transmit(10.109.16.2)
2016-01-25T18:45:57.613533+00:00 notice: receive(10.109.16.2)
2016-01-25T18:45:59.613488+00:00 notice: transmit(10.109.16.2)
2016-01-25T18:45:59.613488+00:00 notice: adjust time server 10.109.16.2 offset -0.000337 sec
2016-01-25T18:45:59.613499+00:00 notice: receive(10.109.16.2)
2016-01-25T18:45:59.613601+00:00 notice: server 10.109.16.2, port 123
2016-01-25T18:45:59.613693+00:00 notice: stratum 4, precision -24, leap 00, trust 000
2016-01-25T18:45:59.613790+00:00 notice: refid [10.109.16.2], delay 0.02594, dispersion 0.00012
2016-01-25T18:45:59.613874+00:00 notice: transmitted 4, in filter 4
2016-01-25T18:45:59.613979+00:00 notice: reference time: da50eba7.9627817c Mon, Jan 25 2016 18:44:55.586
2016-01-25T18:45:59.614090+00:00 notice: originate timestamp: da50ebe7.9cea9f10 Mon, Jan 25 2016 18:45:59.612
2016-01-25T18:45:59.614184+00:00 notice: transmit timestamp: da50ebe7.9cfc135d Mon, Jan 25 2016 18:45:59.613
2016-01-25T18:45:59.614268+00:00 notice: filter delay: 0.02594 0.02623 0.02611 0.02597
2016-01-25T18:45:59.614360+00:00 notice: 0.00000 0.00000 0.00000 0.00000
2016-01-25T18:45:59.614467+00:00 notice: filter offset: -0.00033 -0.00048 -0.00050 -0.00049
2016-01-25T18:45:59.614795+00:00 notice: 0.000000 0.000000 0.000000 0.000000
2016-01-25T18:45:59.614795+00:00 notice: delay 0.02594, dispersion 0.00012
2016-01-25T18:45:59.614801+00:00 notice: offset -0.000337
2016-01-25T18:45:59.614872+00:00 notice:
2016-01-25T18:45:59.615103+00:00 notice: 25 Jan 18:45:59 ntpdate[8103]: adjust time server 10.109.16.2 offset -0.000337 sec
2016-01-25T18:45:59.616662+00:00 notice: ntpdate 4.2.6p5@1.2349-o Thu Jan 21 23:29:28 UTC 2016 (1)
2016-01-25T18:46:05.718935+00:00 notice: adjust time server 10.109.16.2 offset -0.000435 sec

I trying to reproduce this task with strace patch, but with no luck yet. As we know that problem is in ntpdate itself and not at library side, I kindly ask mos-linux team take a look into this to shed some light.

Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → MOS Linux (mos-linux)
Revision history for this message
Aleksander Mogylchenko (amogylchenko) wrote :

This could have happened because of two reasons:
1. Bug in ntpdate itself
2. Network problems (given the fact that we had a lot of failures recently, this is actually more probable).

But in any case ntpdate is outdated and should not be used:
http://support.ntp.org/bin/view/Dev/DeprecatingNtpdate

Please migrate the code to sntp and continue debugging there.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Stanislaw Bogatkin (sbogatkin)
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Attached with strace from failed bvt.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/273515

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Stanislaw Bogatkin (<email address hidden>) on branch: master
Review: https://review.openstack.org/271219

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/270757
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=6b477d5ea74e788e21cc4d07bdd95b6d564fb32b
Submitter: Jenkins
Branch: master

commit 6b477d5ea74e788e21cc4d07bdd95b6d564fb32b
Author: Stanislaw Bogatkin <email address hidden>
Date: Thu Jan 21 15:02:43 2016 +0300

    Wrap ntpdate to timeout

    As have some problems with retry hanging ntpdate outside,
    wrap it to timeout with preserve status to better exit code
    manage.

    Change-Id: If2967106ef7267a454ca012a444eeda9b7c05793
    Closes-Bug: #1533082

Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/273634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/273634
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d24aa1ae0207c7a3d9ba6c297acaaa4bcc54e5f0
Submitter: Jenkins
Branch: master

commit d24aa1ae0207c7a3d9ba6c297acaaa4bcc54e5f0
Author: Vladimir Kuklin <email address hidden>
Date: Thu Jan 28 18:56:25 2016 +0300

    Wrap ntpdate to better timeout

    Change-Id: Ia478c28695cba83c62c9091a5788ca74e1c4534e
    Partial-Bug: #1533082

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/273973

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Vova it is really critical issue that happens everywhere, so please add the comment why you mark it duplicate, please. Until this explanation leave it as confirmed

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

RCA shows that there were a restart of mcollective which is not handled by astute.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Tatiana, I'll give more broad explanation here:

1. Slave node restart after provisioning, mcollective starts.
2. Astute send command to mcollective.
3. Mcollective run a task
4. Cloud-init starts. It had 'service mcollective restart' in code. <- this fixed now
5. Mcollective restarts and can't handle ntpdate exit code. <- this will be fixed in 9.0 later
6. Astute get nothing and terminate env deploy.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Cool, Thank you Stanislaw and Serge, for now it is very clear for everyone. :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/273973
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=8e14187f4bf517fad2e96a34e377951543d700c8
Submitter: Jenkins
Branch: stable/8.0

commit 8e14187f4bf517fad2e96a34e377951543d700c8
Author: Vladimir Kuklin <email address hidden>
Date: Thu Jan 28 18:56:25 2016 +0300

    Wrap ntpdate to better timeout

    Change-Id: Ia478c28695cba83c62c9091a5788ca74e1c4534e
    Partial-Bug: #1533082
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/8.0)

Change abandoned by Stanislaw Bogatkin (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/273515

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.