nailgun-agent hangs when it can not list disks

Bug #1396086 reported by Łukasz Oleś on 2014-11-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Vladimir Sharshov
5.1.x
High
Registry Administrators
6.0.x
High
Registry Administrators
6.1.x
High
Vladimir Sharshov

Bug Description

When one of disks is broken, for example(from dmesg):

[12861.715699] Buffer I/O error on device fd0, logical block 0
[12873.965252] end_request: I/O error, dev fd0, sector 0

nailgun-agent will hang. Node will go to offline state.

Łukasz Oleś (loles) on 2014-11-25
Changed in fuel:
importance: Undecided → Medium
tags: added: nailgun-agent
Sergii Golovatiuk (sgolovatiuk) wrote :

Due to definitions https://wiki.openstack.org/wiki/Fuel/How_to_contribute this bug as such cases may appear and there is no workaround for this.

Mike Scherbakov (mihgen) wrote :

This is not a Medium, certainly higher.
If node goes offline, we can't really deploy anything on it. I hope agent, when starts, verifies if there is a copy already running. Otherwise we may end up with hundreds of nailgun-agents in Linux before node goes into stuck mode.

Łukasz Oleś (loles) wrote :

Mike, there will be only one agent. It uses locks and new will not start until the old one finishes

Changed in fuel:
milestone: none → 6.1
tags: added: module-astute
Changed in fuel:
milestone: 6.1 → 7.0
no longer affects: fuel/7.0.x
Mike Scherbakov (mihgen) wrote :

Vladimir, folks,
why can't we simply surround the code with
require 'timeout'
status = Timeout::timeout(5) {
  # Something that should be interrupted if it takes more than 5 seconds...
}
?

Failure of one disk should not affect the whole node from being discovered in Nailgun.

tags: added: qa-agree-7.0 release-notes
Vladimir Sharshov (vsharshov) wrote :

+1 for Mike solution.

But we have little side affect: some disks will be disappeared from web UI. I think this is small price.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
assignee: Vladimir Sharshov (vsharshov) → nobody
assignee: nobody → Fuel Astute Team (fuel-astute)

Fix proposed to branch: master
Review: https://review.openstack.org/180100

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress
Changed in fuel:
milestone: 7.0 → 6.1

Reviewed: https://review.openstack.org/180100
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=689fdeddb8b08cf07300dd554603f65d495559f4
Submitter: Jenkins
Branch: master

commit 689fdeddb8b08cf07300dd554603f65d495559f4
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue May 5 13:35:08 2015 +0300

    Prevent agent hangs if ohai does not return disks

    Instead of freeze we got all data without disks now.
    Current timeout - 30 sec.

    Co-Authored-By: Mike Scherbakov (mihgen) <email address hidden>
    Change-Id: I65d1b570cd01e12b521403c6d6e990043eb2c2ab
    Closes-Bug: #1396086

Changed in fuel:
status: In Progress → Fix Committed
tags: removed: qa-agree-7.0 release-notes
tags: added: on-verification
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers