OpenStack Compute (Nova)

nova-network and nova-compute fails to start without logging an error when iptables locks are present

Reported by Anthony Young on 2011-05-20
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Mike Pittaro

Bug Description

It is possible for nova-network to fail in ways that leaves an iptables lock file in /var/lock/nova. In the event that a lock is present, nova-network will not be able to restart, and no error will be provided in the logfile.

Thierry Carrez (ttx) wrote :

Any detail on the failure scenario that could be used to reproduce ?

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Andrew Glen-Young (aglenyoung) wrote :

I have managed to trigger this bug on a compute node.

Unfortunately, I will have to use some conjecture when it comes to how the error occured. Nearest I can gather:

 1. Compute node receieves a message to create a new instance.
 2. Firewall rules are being created for the instance. Meanwhile...
 3. Libvirtd crashes.
 4. Nova-compute crashes.
 5. Stale iptables lock file persists.

I have not (yet) verified the above scenario.

While the lock file persists nova-compute does not seem to process any messages from the messages queue. In fact the entire process seems wedged waiting on the lock file.

Symptoms that lead me to investigate:

 1. RabbitMQ queue for the compute node was non-zero
 2. Nova-compute was not running.

Resolution:

 1. Restart libvirtd
 2. Restart nova-compute
 3. Realise that nova-compute would not persist past running iptables for the first instance discovered. The log message is as follows:

2011-11-11 10:27:47,800 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "_do_refresh_provider_fw_rules"... from (pid=17684) inner /usr/lib/python2.7/dist-packages/nova/utils.py:680

 4. Strace the nova-compute process and discover that nova-compute is looping while waiting for a lock to be released. A sample of the strace follows:

stat("/var/lock/nova/compute.Dummy-1-17684", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
epoll_wait(4, {}, 1023, 99) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
epoll_wait(4, {}, 1023, 0) = 0
link("/var/lock/nova/compute.Dummy-1-17684", "/var/lock/nova/nova-iptables.lock.lock") = -1 EEXIST (File exists)

 5. Stop nova-compute
 6. Verify that nova-compute is not running
 7. Remove the lock file
 8. Start nova-compute
 9. Verify that the service has started correctly.

summary: - nova-network fails to start without logging an error when iptables locks
- are present
+ nova-network and nova-compute fails to start without logging an error
+ when iptables locks are present

I have this problem all the time, I'm on centos using griddynamics packages.

On my side, I do not need to try and reproduce it, it happens all the time if I have qemu-* at version 0.15.0 So I basically only get half of my instances up and running before a race condition occur (or like explained in this threat, related to crashing of libvirt or compute) and when it occurs, it stops spawning instances, the system wait for a lock file that will never come (or never leave, whatever).

This problem is not nearly as frequent with qemu 0.12

Boris (praefect)

Mike Pittaro (mikeyp-3) wrote :

I have managed to trigger this running tempest - will attempt to narrow down the tests to something easily reproduced.

Changed in nova:
assignee: nobody → Mike Pittaro (mikeyp-3)
James Troup (elmo) wrote :

We just ran into this (again) with nova-compute from precise, FWIW.

Fix proposed to branch: master
Review: https://review.openstack.org/4516

Changed in nova:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/4516
Committed: http://github.com/openstack/nova/commit/2fbccc0c693193533284330325f5803c8c6ce52a
Submitter: Jenkins
Branch: master

commit 2fbccc0c693193533284330325f5803c8c6ce52a
Author: Mike Pittaro <email address hidden>
Date: Fri Feb 24 09:56:26 2012 -0800

    Clean stale lockfiles on service startup : fixes bug 785955

    Adds cleanup_files_locks() to nova/utils, which cleans up stale locks
    left behind after process failures.

    Adds a call to clean up locks on service startup for nova-api, nova-cert,
    nova-compute, nova-network, nova-objectstore, and nova-scheduler.

    Adds tools/clean_file_locks.py, which can be used to manually clean
    stale locks.

    Change-Id: I752e0b24d3c7fc5f1dc290da355cbd7f430789b8

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-02-29
Changed in nova:
milestone: none → essex-4
status: Fix Committed → Fix Released
tags: added: canonistack
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-4 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers