Fuel-nailgun-agent execution expired

Bug #1665584 reported by Aleksei Chekunov
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Fuel Sustaining
Mitaka
Fix Committed
High
Oleksiy Molchanov
Newton
Confirmed
Medium
Fuel Sustaining

Bug Description

MOS 9.2 controller nodes on VM's.
After successful deployment fuel-UI shows that some nodes has gone away and back online many times.
Today
09:19:29
Node 'controller' is back online
09:19:13
Node 'controller' has gone away
09:14:09
Node 'controller3' is back online
09:13:13
Node 'controller' is back online
09:13:13
Node 'controller3' has gone away
09:12:43
Node 'controller' has gone away
09:11:32
Node 'controller2' is back online
09:10:13
Node 'controller2' has gone away
09:09:13
Node 'controller' is back online
09:09:12
Node 'controller' has gone away
08:58:30
Node 'controller3' is back online
08:58:24
Node 'controller2' is back online
08:58:12
Node 'controller3' has gone away
08:58:12
Node 'controller2' has gone away
08:57:27
Node 'controller' is back online
08:57:11
Node 'controller' has gone away
08:49:13
Node 'controller3' is back online
08:49:11
Node 'controller3' has gone away
08:43:23
Node 'controller3' is back online
08:43:17
Node 'controller2' is back online
08:43:10
Node 'controller3' has gone away
08:42:40
Node 'controller2' has gone away
08:42:18
Node 'controller' is back online
08:41:40
Node 'controller' has gone away
08:33:48
Node 'controller2' is back online
08:33:40
Node 'controller2' has gone away
08:29:45
Node 'controller' is back online

.....

in /var/log/nailgun-agent.log:

E, [2017-02-17T08:54:10.872327 #1351] ERROR -- : Error 'execution expired' in gathering disks metadata: ["/usr/bin/nailgun-agent:812:in ``'", "/usr/bin/nailgun-agent:812:in `_multipath_devices'", "/usr/bin/nailgun-agent:748:in `block in _detailed'", "/usr/bin/nailgun-agent:737:in `_detailed'", "/usr/bin/nailgun-agent:1212:in `_data'", "/usr/bin/nailgun-agent:210:in `put'", "/usr/bin/nailgun-agent:1419:in `<main>'"]
I, [2017-02-17T08:54:11.766303 #1351] INFO -- : Wrote data to file '/etc/nailgun_uid'. Data: 1
at depth 0 - 18: self signed certificate
I, [2017-02-17T08:56:24.299347 #15936] INFO -- : API URL is https://10.30.0.2:8443/api
at depth 0 - 18: self signed certificate
E, [2017-02-17T08:57:27.295680 #15936] ERROR -- : Error 'execution expired' in gathering disks metadata: ["/usr/bin/nailgun-agent:812:in ``'", "/usr/bin/nailgun-agent:812:in `_multipath_devices'", "/usr/bin/nailgun-agent:748:in `block in _detailed'", "/usr/bin/nailgun-agent:737:in `_detailed'", "/usr/bin/nailgun-agent:1212:in `_data'", "/usr/bin/nailgun-agent:210:in `put'", "/usr/bin/nailgun-agent:1419:in `<main>'"]
I, [2017-02-17T08:57:27.430535 #15936] INFO -- : Wrote data to file '/etc/nailgun_uid'. Data: 1
at depth 0 - 18: self signed certificate
I, [2017-02-17T08:59:10.749276 #30134] INFO -- : API URL is https://10.30.0.2:8443/api
at depth 0 - 18: self signed certificate
..........

affects: designate → fuel
Changed in fuel:
milestone: none → 10.1
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Roman Rufanov (rrufanov)
tags: added: customer-found support
Changed in fuel:
milestone: 10.1 → 11.0
tags: added: area-python
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Alexey Stupnikov (astupnikov)
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

I have used strace to find out what is the reason of 'udevadm settle' timeouts. It turns out that it is a known udev issue described here [1] and kernel developers are fine with it. On the other hand, even if they will solve this issue, the patch they will write will be non-backportable for us, so there is no way to fix this issue directly. I think that the best WA will be calling 'udevadm settle' with reasonable timeout, say 15 seconds, that will allow us to fix the original bug with mpath devices being not ready at agent's startup, but will also allow us to use current agent's timeouts without flaps.

[1] https://lists.gt.net/linux/kernel/1524376

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/444310

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

As per azvyagintsev comment in review [1], proposed timeout may not be big enough to allow OS to initialize mpath devices. As a result, it will break a previous fix [2]. I think that proposed change [1] is still a good WA for reported issue, but some solid solution is needed to close the bug in a right way. Assigning the bug to fuel-sustaining group and will wait for the fix to backport.

[1] https://review.openstack.org/#/c/444310/2
[2] https://review.openstack.org/#/c/285340/

Changed in fuel:
assignee: Alexey Stupnikov (astupnikov) → Fuel Sustaining (fuel-sustaining-team)
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-nailgun-agent (master)

Change abandoned by Alexey Stupnikov (<email address hidden>) on branch: master
Review: https://review.openstack.org/444310
Reason: Can break mpath device initialization.

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla1 for 9.0-updates

tags: added: sla1
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/fuel-nailgun-agent (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Oleksiy Molchanov <email address hidden>
Review: https://review.fuel-infra.org/38899

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/fuel-nailgun-agent (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/38899
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: dd5fc8ed168ab7714e8ca1bfb6cc152c448f6704
Author: Oleksiy Molchanov <email address hidden>
Date: Tue Jul 17 12:34:38 2018

Udevadm settle loops

There is an issue with 'udevadm settle' command described at [1]. It
turns out that inconsistent task IDs stored in queue.bin and
uevent_seqnum could cause udevadm settle to loop trying to find
the task that doesn't exist. Kernel developers are skeptical about
this issue and not willing to fix it, but even if they will, we
still will take it to MOS.

By default, it uses a timeout of 180 seconds that is inconsistent
with a set of existing nailgun-agent's timeouts. We can remove this
check, absence of it shouldn't break anything.

[1] https://lists.gt.net/linux/kernel/1524376

Change-Id: I0e71786c3516463496a313ea17438582a45ad2aa
Closes-bug: #1665584

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.