While booting tripleo undercloud have occassionally (about 8 in 10 attempts) seen errors in neutron-dhcp-agent.log of:
2014-07-08 10:12:07.374 3905 ERROR neutron.agent.linux.utils [-]
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-e967d587-5999-4333-8405-76b32398b121', 'ip', 'link', 'set', 'tap4ac4d6ab-d7', 'up']
Exit code: 1
Stdout: ''
Stderr: 'Cannot open network namespace "qdhcp-e967d587-5999-4333-8405-76b32398b121": No such file or directory\n'
2014-07-08 10:12:07.419 3905 ERROR neutron.agent.linux.utils [-]
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-e967d587-5999-4333-8405-76b32398b121', 'ip', '-o', 'link', 'show', 'tap4ac4d6ab-d7']
Exit code: 1
Stdout: ''
Stderr: 'Cannot open network namespace "qdhcp-e967d587-5999-4333-8405-76b32398b121": No such file or directory\n'
In all cases these ERRORs are preceeded by the ERROR:
2014-07-08 10:12:07.151 3905 ERROR neutron.agent.dhcp_agent [req-523b2309-18c9-4a2d-ad65-7586db10f956 None] Failed reporting state!
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent Traceback (most recent call last):
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/dhcp_agent.py", line 576, in _report_state
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent self.state_rpc.report_state(ctx, self.agent_state, self.use_call)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 70, in report_state
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent return self.call(context, msg, topic=self.topic)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 161, in call
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent context, msg, rpc_method='call', **kwargs)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 185, in __call_rpc_method
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent return func(context, msg['method'], **msg['args'])
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent wait_for_reply=True, timeout=timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent timeout=timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 412, in send
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 403, in _send
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent result = self._waiter.wait(msg_id, timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent reply, ending = self._poll_connection(msg_id, timeout)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent % msg_id)
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent MessagingTimeout: Timed out waiting for a reply to message ID 21fd392ebb3a43228447f380334d05b0
2014-07-08 10:12:07.151 3905 TRACE neutron.agent.dhcp_agent
2014-07-08 10:12:07.157 3905 WARNING neutron.openstack.common.loopingcall [-] task run outlasted interval by 30.119771 sec
If I then wait for the undercloud to fail and login to the seed host, I can manually run the command that failed previously:
root@ubuntu:/var/log/upstart# sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec 'qdhcp-e967d587-5999-4333-8405-76b32398b121' ip link set tap4ac4d6ab-d7 up
Replicate the ERROR message by attempting to exec a command for a non existent namespace..
------------------------------------------------------------------------------------------------------------------------------------
root@ubuntu:/var/log/upstart# sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec 'qdhcp-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' ip link set tap4ac4d6ab-d7 up
Cannot open network namespace "qdhcp-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx": No such file or directory
The code section that returns the error (neutron/agent/linux/ip_lib.py - see below) has a comment regarding the return of a RuntimeError if the device does not exist - so this could be seen as expected behaviour
def ensure_device_is_ready(device_name, root_helper=None, namespace=None):
dev = IPDevice(device_name, root_helper, namespace)
try:
# Ensure the device is up, even if it is already up. If the device
# doesn't exist, a RuntimeError will be raised.
dev.link.set_up()
except RuntimeError:
return False
return True
- why does the device not exist? (probably a timing issue)
- why not wait until the device exists before attempting the operation?
what check could be put in place - prior to calling this function (or as part of this function) that could check on the state of the device
would a retry be appropriate?
- Is the name of the function correct? - does not seem to 'ensure_device_is_ready' - more like "set_device_interface_up"?
Both error messages are expected. They occur during normal operation when namespace does not exist yet. The easiest way to repro is to create new network, subnet with dhcp and then launch instance plugged into that network.
The first error is written while calling to ensure_ device_ is_ready( ) (ip_lib.py #557). This function tries to set the interface up and returns True only if the call is succeed. linux.interface #171) at check ip_lib. device_ exists. This function checks device existance by retrieving its address.
The second error happens inside plug() (agent.
Both functions expect RuntimeError if underlying shell command fails. However utils.execute (agent.linux.utils #61) writes error message on every failure. This behavior was changed in https:/ /review. openstack. org/#/c/ 95902/:
"
Previously, the execute method in neutron logs everything as debug which hides
a lot of extremely fatal errors like unable to apply security group rules!
This patch changes this code so that we log all non 0 returns as error.
"
The proposed solution is to return debug-level logging in utils.execute and log errors in those callers that do not treat RuntimeError as normal operation.