Cannot open network namespace

Bug #1339045 reported by moorryan
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

While booting tripleo undercloud have occassionally (about 8 in 10 attempts) seen errors in neutron-dhcp-agent.log of:

2014-07-08 10:12:07.374 3905 ERROR neutron.agent.linux.utils [-]
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-e967d587-5999-4333-8405-76b32398b121', 'ip', 'link', 'set', 'tap4ac4d6ab-d7', 'up']
Exit code: 1
Stdout: ''
Stderr: 'Cannot open network namespace "qdhcp-e967d587-5999-4333-8405-76b32398b121": No such file or directory\n'
2014-07-08 10:12:07.419 3905 ERROR neutron.agent.linux.utils [-]
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-e967d587-5999-4333-8405-76b32398b121', 'ip', '-o', 'link', 'show', 'tap4ac4d6ab-d7']
Exit code: 1
Stdout: ''
Stderr: 'Cannot open network namespace "qdhcp-e967d587-5999-4333-8405-76b32398b121": No such file or directory\n'

In all cases these ERRORs are preceeded by the ERROR:

2014-07-08 10:12:07.151 3905 ERROR neutron.agent.dhcp_agent [req-523b2309-18c9-4a2d-ad65-7586db10f956 None] Failed reporting state!
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent Traceback (most recent call last):
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/dhcp_agent.py", line 576, in _report_state
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent self.state_rpc.report_state(ctx, self.agent_state, self.use_call)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 70, in report_state
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent return self.call(context, msg, topic=self.topic)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 161, in call
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent context, msg, rpc_method='call', **kwargs)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 185, in __call_rpc_method
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent return func(context, msg['method'], **msg['args'])
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent wait_for_reply=True, timeout=timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent timeout=timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 412, in send
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 403, in _send
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent result = self._waiter.wait(msg_id, timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent reply, ending = self._poll_connection(msg_id, timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent % msg_id)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent MessagingTimeout: Timed out waiting for a reply to message ID 21fd392ebb3a43228447f380334d05b0
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent
2014-07-08 10:12:07.157 3905 WARNING neutron.openstack.common.loopingcall [-] task run outlasted interval by 30.119771 sec

If I then wait for the undercloud to fail and login to the seed host, I can manually run the command that failed previously:
root@ubuntu:/var/log/upstart# sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec 'qdhcp-e967d587-5999-4333-8405-76b32398b121' ip link set tap4ac4d6ab-d7 up

Replicate the ERROR message by attempting to exec a command for a non existent namespace..
------------------------------------------------------------------------------------------------------------------------------------
root@ubuntu:/var/log/upstart# sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec 'qdhcp-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' ip link set tap4ac4d6ab-d7 up
Cannot open network namespace "qdhcp-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx": No such file or directory

The code section that returns the error (neutron/agent/linux/ip_lib.py - see below) has a comment regarding the return of a RuntimeError if the device does not exist - so this could be seen as expected behaviour
def ensure_device_is_ready(device_name, root_helper=None, namespace=None):
    dev = IPDevice(device_name, root_helper, namespace)
    try:
        # Ensure the device is up, even if it is already up. If the device
        # doesn't exist, a RuntimeError will be raised.
        dev.link.set_up()
    except RuntimeError:
        return False
    return True

 - why does the device not exist? (probably a timing issue)

 - why not wait until the device exists before attempting the operation?
    what check could be put in place - prior to calling this function (or as part of this function) that could check on the state of the device
    would a retry be appropriate?

 - Is the name of the function correct? - does not seem to 'ensure_device_is_ready' - more like "set_device_interface_up"?

Tags: l3-ipam-dhcp
tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Medium
assignee: nobody → Ilya Shakhat (shakhat)
status: New → Confirmed
Revision history for this message
Ilya Shakhat (shakhat) wrote :

Both error messages are expected. They occur during normal operation when namespace does not exist yet. The easiest way to repro is to create new network, subnet with dhcp and then launch instance plugged into that network.

The first error is written while calling to ensure_device_is_ready() (ip_lib.py #557). This function tries to set the interface up and returns True only if the call is succeed.
The second error happens inside plug() (agent.linux.interface #171) at check ip_lib.device_exists. This function checks device existance by retrieving its address.

Both functions expect RuntimeError if underlying shell command fails. However utils.execute (agent.linux.utils #61) writes error message on every failure. This behavior was changed in https://review.openstack.org/#/c/95902/:
"
Previously, the execute method in neutron logs everything as debug which hides
a lot of extremely fatal errors like unable to apply security group rules!
This patch changes this code so that we log all non 0 returns as error.
"

The proposed solution is to return debug-level logging in utils.execute and log errors in those callers that do not treat RuntimeError as normal operation.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/111715

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/111715
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=539778601fa13c2d41be604d938bec0a496969c2
Submitter: Jenkins
Branch: master

commit 539778601fa13c2d41be604d938bec0a496969c2
Author: Ilya Shakhat <email address hidden>
Date: Mon Aug 4 16:46:27 2014 +0400

    Make log level in linux.utils.execute configurable

    Function linux.utils.execute always logs failures on error level.
    Some callers use this function to check resource existance and
    interpret failure as normal flow. Log level is made configurable
    for such cases.

    Change-Id: I8e4cd7b42234426f8fe58d353a64e898f02f8998
    Closes-bug: 1339045

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Maru Newby (maru) wrote :

I do not agree with the change that was merged to fix this. Why does execute() need yet another parameter? If a caller cares about a non-zero exit status, the 'check_exit_code' is how they indicate this and error logging should center around that.

Changed in neutron:
status: Fix Committed → In Progress
Changed in neutron:
assignee: Ilya Shakhat (shakhat) → nobody
status: In Progress → Confirmed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 172 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.