nailgun-agent hangs when it can not list disks

Bug #1396086 reported by Łukasz Oleś
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Sharshov
5.1.x
Won't Fix
High
Registry Administrators
6.0.x
Won't Fix
High
Registry Administrators
6.1.x
Fix Committed
High
Vladimir Sharshov

Bug Description

When one of disks is broken, for example(from dmesg):

[12861.715699] Buffer I/O error on device fd0, logical block 0
[12873.965252] end_request: I/O error, dev fd0, sector 0

nailgun-agent will hang. Node will go to offline state.

Łukasz Oleś (loles)
Changed in fuel:
importance: Undecided → Medium
tags: added: nailgun-agent
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Due to definitions https://wiki.openstack.org/wiki/Fuel/How_to_contribute this bug as such cases may appear and there is no workaround for this.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

This is not a Medium, certainly higher.
If node goes offline, we can't really deploy anything on it. I hope agent, when starts, verifies if there is a copy already running. Otherwise we may end up with hundreds of nailgun-agents in Linux before node goes into stuck mode.

Revision history for this message
Łukasz Oleś (loles) wrote :

Mike, there will be only one agent. It uses locks and new will not start until the old one finishes

Changed in fuel:
milestone: none → 6.1
tags: added: module-astute
Changed in fuel:
milestone: 6.1 → 7.0
no longer affects: fuel/7.0.x
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Vladimir, folks,
why can't we simply surround the code with
require 'timeout'
status = Timeout::timeout(5) {
  # Something that should be interrupted if it takes more than 5 seconds...
}
?

Failure of one disk should not affect the whole node from being discovered in Nailgun.

tags: added: qa-agree-7.0 release-notes
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

+1 for Mike solution.

But we have little side affect: some disks will be disappeared from web UI. I think this is small price.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
assignee: Vladimir Sharshov (vsharshov) → nobody
assignee: nobody → Fuel Astute Team (fuel-astute)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/180100

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress
Changed in fuel:
milestone: 7.0 → 6.1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/180100
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=689fdeddb8b08cf07300dd554603f65d495559f4
Submitter: Jenkins
Branch: master

commit 689fdeddb8b08cf07300dd554603f65d495559f4
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue May 5 13:35:08 2015 +0300

    Prevent agent hangs if ohai does not return disks

    Instead of freeze we got all data without disks now.
    Current timeout - 30 sec.

    Co-Authored-By: Mike Scherbakov (mihgen) <email address hidden>
    Change-Id: I65d1b570cd01e12b521403c6d6e990043eb2c2ab
    Closes-Bug: #1396086

Changed in fuel:
status: In Progress → Fix Committed
tags: removed: qa-agree-7.0 release-notes
tags: added: on-verification
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.