nova-network and nova-compute fails to start without logging an error when iptables locks are present

Bug #785955 reported by Anthony Young
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Mike Pittaro

Bug Description

It is possible for nova-network to fail in ways that leaves an iptables lock file in /var/lock/nova. In the event that a lock is present, nova-network will not be able to restart, and no error will be provided in the logfile.

Tags: canonistack
Revision history for this message
Thierry Carrez (ttx) wrote :

Any detail on the failure scenario that could be used to reproduce ?

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

I have managed to trigger this bug on a compute node.

Unfortunately, I will have to use some conjecture when it comes to how the error occured. Nearest I can gather:

 1. Compute node receieves a message to create a new instance.
 2. Firewall rules are being created for the instance. Meanwhile...
 3. Libvirtd crashes.
 4. Nova-compute crashes.
 5. Stale iptables lock file persists.

I have not (yet) verified the above scenario.

While the lock file persists nova-compute does not seem to process any messages from the messages queue. In fact the entire process seems wedged waiting on the lock file.

Symptoms that lead me to investigate:

 1. RabbitMQ queue for the compute node was non-zero
 2. Nova-compute was not running.

Resolution:

 1. Restart libvirtd
 2. Restart nova-compute
 3. Realise that nova-compute would not persist past running iptables for the first instance discovered. The log message is as follows:

2011-11-11 10:27:47,800 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "_do_refresh_provider_fw_rules"... from (pid=17684) inner /usr/lib/python2.7/dist-packages/nova/utils.py:680

 4. Strace the nova-compute process and discover that nova-compute is looping while waiting for a lock to be released. A sample of the strace follows:

stat("/var/lock/nova/compute.Dummy-1-17684", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
epoll_wait(4, {}, 1023, 99) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
link("/var/lock/nova/compute.Dummy-1-17684", "/var/lock/nova/nova-iptables.lock.lock") = -1 EEXIST (File exists)

 5. Stop nova-compute
 6. Verify that nova-compute is not running
 7. Remove the lock file
 8. Start nova-compute
 9. Verify that the service has started correctly.

summary: - nova-network fails to start without logging an error when iptables locks
- are present
+ nova-network and nova-compute fails to start without logging an error
+ when iptables locks are present
Revision history for this message
Boris Deschenes (boris-michel-deschenes) wrote :

I have this problem all the time, I'm on centos using griddynamics packages.

On my side, I do not need to try and reproduce it, it happens all the time if I have qemu-* at version 0.15.0 So I basically only get half of my instances up and running before a race condition occur (or like explained in this threat, related to crashing of libvirt or compute) and when it occurs, it stops spawning instances, the system wait for a lock file that will never come (or never leave, whatever).

This problem is not nearly as frequent with qemu 0.12

Boris (praefect)

Revision history for this message
Mike Pittaro (mikeyp-3) wrote :

I have managed to trigger this running tempest - will attempt to narrow down the tests to something easily reproduced.

Changed in nova:
assignee: nobody → Mike Pittaro (mikeyp-3)
Revision history for this message
James Troup (elmo) wrote :

We just ran into this (again) with nova-compute from precise, FWIW.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/4516

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/4516
Committed: http://github.com/openstack/nova/commit/2fbccc0c693193533284330325f5803c8c6ce52a
Submitter: Jenkins
Branch: master

commit 2fbccc0c693193533284330325f5803c8c6ce52a
Author: Mike Pittaro <email address hidden>
Date: Fri Feb 24 09:56:26 2012 -0800

    Clean stale lockfiles on service startup : fixes bug 785955

    Adds cleanup_files_locks() to nova/utils, which cleans up stale locks
    left behind after process failures.

    Adds a call to clean up locks on service startup for nova-api, nova-cert,
    nova-compute, nova-network, nova-objectstore, and nova-scheduler.

    Adds tools/clean_file_locks.py, which can be used to manually clean
    stale locks.

    Change-Id: I752e0b24d3c7fc5f1dc290da355cbd7f430789b8

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → essex-4
status: Fix Committed → Fix Released
tags: added: canonistack
Thierry Carrez (ttx)
Changed in nova:
milestone: essex-4 → 2012.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.