fuel-agent should retry if loop device allocation failed

Bug #1506071 reported by Alexander Gordeev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Andrey Tykhonov

Bug Description

if one starts 2 or more instances of fuel-agent simultaneously, then fuel-agent could fail on loop device allocation.

failure scenario is pretty simple.
1) fuel-agent finds first free loop device. For the case of multiple instances, some of instances could find the same loop device as free.
2) fuel-agent allocates it. Once it's being allocated, it becomes to be busy and not free. Other instances which previously found the same loop device as free will throw an exception, just because one of fuel-agent instances already occupied that loop device.

To fix that, retry should be added to this piece of code: https://github.com/openstack/fuel-agent/blob/f78e4eba30254d6e7307d6cd6a4fbacccb1670c3/fuel_agent/manager.py#L611-L617

the fix is simple:
1) find free loop device
2) allocate it. If you can't allocate it and got ProcessExecutionError, then go to 1) until max retry count is not met.
max retry count should be available as a config parameter.

Changed in fuel:
status: New → Triaged
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Andrey Tykhonov (atykhonov)
description: updated
Changed in fuel:
status: Triaged → In Progress
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-agent (master)

Reviewed: https://review.openstack.org/237624
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=9e4b5ad8dc83e4d48359f6c165a5cc85f2916eac
Submitter: Jenkins
Branch: master

commit 9e4b5ad8dc83e4d48359f6c165a5cc85f2916eac
Author: Andrey Tykhonov <email address hidden>
Date: Tue Oct 20 18:00:38 2015 +0300

    Make several attempts to attach image file

    If one starts 2 or more instances of fuel-agent simultaneously, then
    fuel-agent could fail on loop device allocation. With this patch it
    makes several attempts to attach temporary image file to loop device.

    Change-Id: I502eb07a69a2d813157d7511fc03032671e98196
    Closes-Bug: #1506071

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Vladimir (vushakov) wrote :

Steps for verification/reproducing are unclear
@a-gordeev could you please update description with additional information how fix can be checked?

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

bug was caught under very specific circumstances.

accidentally, 2 instances of mcollective were running instead of one due to error in fuel-main somewhere. That error was fixed, but the bug which it introduced could be encountered only if 2 instances of fuel-agent will be started simultaneously with the extremely precise timing.

i can't suggest any 100% working steps for reproducing. I doubt if it could be reproduced with our regular deployment flow.

however, committed fix comes with nice unit test coverage, so, i think it's absolutely safe to mark it as 'fix released' according to the committed code without performing of full verification/reproducing cycle against the fix.

Vladimir (vushakov)
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.