nailgun-agent hangs when it can not list disks

Bug #1396086 reported by Łukasz Oleś on 2014-11-25
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Vladimir Sharshov
Registry Administrators
Registry Administrators
Vladimir Sharshov

Bug Description

When one of disks is broken, for example(from dmesg):

[12861.715699] Buffer I/O error on device fd0, logical block 0
[12873.965252] end_request: I/O error, dev fd0, sector 0

nailgun-agent will hang. Node will go to offline state.

Łukasz Oleś (loles) on 2014-11-25
Changed in fuel:
importance: Undecided → Medium
tags: added: nailgun-agent
Sergii Golovatiuk (sgolovatiuk) wrote :

Due to definitions this bug as such cases may appear and there is no workaround for this.

Mike Scherbakov (mihgen) wrote :

This is not a Medium, certainly higher.
If node goes offline, we can't really deploy anything on it. I hope agent, when starts, verifies if there is a copy already running. Otherwise we may end up with hundreds of nailgun-agents in Linux before node goes into stuck mode.

Łukasz Oleś (loles) wrote :

Mike, there will be only one agent. It uses locks and new will not start until the old one finishes

Changed in fuel:
milestone: none → 6.1
tags: added: module-astute
Changed in fuel:
milestone: 6.1 → 7.0
no longer affects: fuel/7.0.x
Mike Scherbakov (mihgen) wrote :

Vladimir, folks,
why can't we simply surround the code with
require 'timeout'
status = Timeout::timeout(5) {
  # Something that should be interrupted if it takes more than 5 seconds...

Failure of one disk should not affect the whole node from being discovered in Nailgun.

tags: added: qa-agree-7.0 release-notes
Vladimir Sharshov (vsharshov) wrote :

+1 for Mike solution.

But we have little side affect: some disks will be disappeared from web UI. I think this is small price.

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
assignee: Vladimir Sharshov (vsharshov) → nobody
assignee: nobody → Fuel Astute Team (fuel-astute)

Fix proposed to branch: master

Changed in fuel:
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
status: Confirmed → In Progress
Changed in fuel:
milestone: 7.0 → 6.1

Submitter: Jenkins
Branch: master

commit 689fdeddb8b08cf07300dd554603f65d495559f4
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Tue May 5 13:35:08 2015 +0300

    Prevent agent hangs if ohai does not return disks

    Instead of freeze we got all data without disks now.
    Current timeout - 30 sec.

    Co-Authored-By: Mike Scherbakov (mihgen) <email address hidden>
    Change-Id: I65d1b570cd01e12b521403c6d6e990043eb2c2ab
    Closes-Bug: #1396086

Changed in fuel:
status: In Progress → Fix Committed
tags: removed: qa-agree-7.0 release-notes
tags: added: on-verification
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers