test_slaac_from_os frequent failures

Bug #1410982 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Miguel Angel Ajo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/147308

Changed in neutron:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote : Re: tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os blowing up hard in the gate
Changed in neutron:
importance: Undecided → Critical
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

The failures are introduced by my patch
https://review.openstack.org/#/c/144977

The restart() method of the DhcpLocalProcess will call disable, then enable on
the same instantiation of the driver.

disable will remove all the config files via self._remove_config_files (L253), then
enable will crash trying to set the interface name because the network conf dir
is not available anymore.

I'm submitting a patch for the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/147347

Changed in neutron:
assignee: Matt Riedemann (mriedem) → Miguel Angel Ajo (mangelajo)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/147347
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=718b2950616fc748881f9f644bc7000961886c65
Submitter: Jenkins
Branch: master

commit 718b2950616fc748881f9f644bc7000961886c65
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Jan 15 00:23:29 2015 +0000

    Call on dhcp-agent DhcpLocalProcess.restart() breaks dhcp

    The bugfix implemented on Ib37651f7f802debd472ab292b148c2a2496063a3
    creates the network_conf_dir only during __init__, and restart()
    method calls disable(), then enable(), where disable will remove
    the config directory and all configuration files.

    Previous implementation re-created the network_conf_dir on the
    interface_name setter (called from enable) with ensure_conf_dir=True.

    Change-Id: I6afaea062c9eed5c63fc64fb0113f460b2cf35c5
    Closes-Bug: 1410982

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote : Re: tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os blowing up hard in the gate
Changed in neutron:
status: Fix Committed → Confirmed
summary: - tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os
- blowing up hard in the gate
+ test_slaac_from_os frequent failures
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Other root causes need to be determined.

Changed in neutron:
importance: Critical → High
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

The logstash query from comment #3 still shows 'Cannot find device "None"' which started with the patch mentioned in the same comment.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

I believe now we're back to what I found in the original bug:

https://bugs.launchpad.net/neutron/+bug/1407618 (look at the kibana link in there)

reload_allocations being called without the config directory being properly initialized,

I believe there is a restart() call into the driver, that goes wrong (exception.Conflict)
because the dhcp_port can't be allocated/got from the plugin, that's captured and printed here [1]

Then reload_allocations is called afterwards, but everything is on an inconsistent state
(dnsmasq hasn't been started by the restart), and reload allocation fails [2]

[1] http://logs.openstack.org/89/147289/2/check/check-tempest-dsvm-neutron-full/827799c/logs/screen-q-dhcp.txt.gz#_2015-01-15_16_48_10_653
[2] http://logs.openstack.org/89/147289/2/check/check-tempest-dsvm-neutron-full/827799c/logs/screen-q-dhcp.txt.gz?#_2015-01-15_16_48_10_656

So, now we're back to the initial state, but it seems that my diagnosis is more accurate. Now
we need to understand why the plugin is failing to return a dhcp_port or we're failing to recover
it for some reason.

[centos@devstack neutron]$ grep Conflict -C4 -n agent/linux/dhcp.py
910- dhcp_port = self.plugin.update_dhcp_port(
911- port.id, {'port': {'network_id': network.id,
912- 'fixed_ips': port_fixed_ips}})
913- if not dhcp_port:
914: raise exceptions.Conflict()
915- else:
916- dhcp_port = port
917- # break since we found port that matches device_id
918- break
--
945- fixed_ips=[dict(subnet_id=s) for s in dhcp_enabled_subnet_ids])
946- dhcp_port = self.plugin.create_dhcp_port({'port': port_dict})
947-
948- if not dhcp_port:
949: raise exceptions.Conflict()
950-
951- # Convert subnet_id to subnet dict
952- fixed_ips = [dict(subnet_id=fixed_ip.subnet_id,
953- ip_address=fixed_ip.ip_address,

I will be looking at that tonight.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/147308
Reason: This is not the root cause for the reported bug.

Revision history for this message
Brian Haley (brian-haley) wrote :

Miguel - in the logstash pointer from your comment above I see this:

1. "Good" call into the namespace to list routes

/lockutils.py:202
2015-01-15 16:48:10.167 2660 DEBUG neutron.agent.linux.utils [req-f4ad1fc1-c36d-4598-b721-296a7abeee4a None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-5566b546-a31b-40ae-913f-2a72ca2ce0ad', 'ip', 'route', 'list', 'dev', 'tapfd5a26b6-8e']
Exit code: 0

2. Restart action, which is going to kill dnsmasq calling "disable":

/lockutils.py:202
2015-01-15 16:48:10.237 2660 DEBUG neutron.agent.dhcp_agent [req-820e24fc-a7ee-440e-b9c9-80653cd4c8d6 None] Calling driver for network: 5566b546-a31b-40ae-913f-2a72ca2ce0ad action: restart call_driver /opt/stack/new/neutron/neutron/agent/dhcp_agent.py:129
2015-01-15 16:48:10.237 2660 DEBUG neutron.agent.linux.utils [req-820e24fc-a7ee-440e-b9c9-80653cd4c8d6 None] Running command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '14952'] create_process /opt/stack/new/neutron/neutron/agent/linux/utils.py:46

2a. A side-effect of this call to disable is:

  self._remove_config_files()

That will remove the "interface" file from the directory.

3. Next use of self.interface_name is going to try and pull the value from the file that was just removed, and instead will return None, which leads to this:

2015-01-15 16:48:11.045 2660 ERROR neutron.agent.linux.utils [req-b4899e34-1bfc-4d5d-882f-449307039c83 None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-5566b546-a31b-40ae-913f-2a72ca2ce0ad', 'ip', 'route', 'list', 'dev', 'None']
Exit code: 1

So seems as though if we are retaining the port (I believe retain_port=True is passed), then we don't want to remove that interface file, since it will still be the correct interface.

Does that make sense?

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

The higher rate was introduced by my patch https://review.openstack.org/#/c/144977,
and corrected by https://review.openstack.org/147347.

But, there is a baseline of errors that was happening before, and it keeps happening.

I have found that the errors like:

Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-5566b546-a31b-40ae-913f-2a72ca2ce0ad', 'ip', 'route', 'list', 'dev', 'None']
Exit code: 1

Still show up in the logs, even for fully passed tempest runs. And I believe it's basically a
harmless race condition between neutron-server and the neutron-dhcp-agent, when a
subnet get's deleted sometimes and a restart() had been triggered for some other reason:
1) the restart happens, 2) the subnet has been deleted, so fails as Conflict, 3) the reload_allocations
happens, but restart didn't succeed, so it finds everything in an inconsistent state, but we
really don't care since the subnet is deleted, we couldn't restart it because we are not
going to be able to get the dhcp port to the net, which is gone.

We could, by the way, make reload_allocations fail more gracefully, and avoid such transient/harmless
failures confuse our log inspection. I will propose a patch in that direction.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

The baseline errors seems to have gone away, and when I look into errors
related to test_slaac_from_os they are paired to massive network problems,
not specifically to that.

I'm not sure if this bug is obsolete now, or if we need to dig deeper into the logs.

Changed in neutron:
assignee: Miguel Angel Ajo (mangelajo) → nobody
Changed in neutron:
importance: High → Medium
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The root cause that led to the spike has merged here:

https://review.openstack.org/147347

Even though we accidentally fail with the same failure mode, the root cause is different and it should probably be tracked (if we find a good soul) in a different bug.

Changed in neutron:
status: Confirmed → Fix Committed
assignee: nobody → Miguel Angel Ajo (mangelajo)
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → kilo-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.