neutron

test_slaac_from_os frequent failures

Bug #1410982 reported by Matt Riedemann on 2015-01-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Medium	Miguel Angel Ajo	neutron 2015.1.0 "kilo"

Bug Description

http://logs.openstack.org/80/146980/1/gate/gate-tempest-dsvm-neutron-full/3b81989/console.html#_2015-01-14_20_16_32_204

Tons of errors in the dhcp agent logs:

http://logs.openstack.org/80/146980/1/gate/gate-tempest-dsvm-neutron-full/3b81989/logs/screen-q-dhcp.txt.gz?level=TRACE

And the spikes in logstash coincide with this change merged on 1/13:

https://github.com/openstack/neutron/commit/7018eff5ec4021b8a8c75712667c9db66fc65e05

message:"init_l3" AND message:"RTNETLINK answers: File exists" AND tags:"screen-q-dhcp.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiaW5pdF9sM1wiIEFORCBtZXNzYWdlOlwiUlRORVRMSU5LIGFuc3dlcnM6IEZpbGUgZXhpc3RzXCIgQU5EIHRhZ3M6XCJzY3JlZW4tcS1kaGNwLnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDIxMjcwNjY3MTk5LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-14: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/147308

Changed in neutron:
assignee:	nobody → Matt Riedemann (mriedem)
status:	New → In Progress

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-01-14: Re: tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os blowing up hard in the gate

e-r query: https://review.openstack.org/#/c/147309/

Changed in neutron:
importance:	Undecided → Critical

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-01-14:

This query [1] seems to point a little more definitively at another timestamp. I'm starting to look at [2].

[1] http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwicWRoY3BcIiBBTkQgbWVzc2FnZTpcIkNhbm5vdCBmaW5kIGRldmljZVwiIEFORCB0YWdzOlwic2NyZWVuLXEtZGhjcC50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQyMTI3Mjk3MzEzMCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
[2] https://review.openstack.org/#/c/144977/

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-01-14:

This query [1] directly shows the failures of test_slaac_from_os. I'm not sure there is any strong correlation with any particular time there.

[1] http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfdjYuVGVzdEdldHRpbmdBZGRyZXNzLnRlc3Rfc2xhYWNfZnJvbV9vc1wiIEFORCBtZXNzYWdlOkZBSUxFRCBBTkQgdGFnczpcImNvbnNvbGVcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE1LTAxLTA3VDIyOjI4OjQ5KzAwOjAwIiwidG8iOiIyMDE1LTAxLTE0VDAwOjAwOjAwKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MjEyNzY2MDA1NjZ9

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-01-15:

The failures are introduced by my patch
https://review.openstack.org/#/c/144977

The restart() method of the DhcpLocalProcess will call disable, then enable on
the same instantiation of the driver.

disable will remove all the config files via self._remove_config_files (L253), then
enable will crash trying to set the interface name because the network conf dir
is not available anymore.

I'm submitting a patch for the problem.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-15: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/147347

Changed in neutron:
assignee:	Matt Riedemann (mriedem) → Miguel Angel Ajo (mangelajo)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-15: Fix merged to neutron (master)

Reviewed: https://review.openstack.org/147347
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=718b2950616fc748881f9f644bc7000961886c65
Submitter: Jenkins
Branch: master

commit 718b2950616fc748881f9f644bc7000961886c65
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Jan 15 00:23:29 2015 +0000

Call on dhcp-agent DhcpLocalProcess.restart() breaks dhcp

    The bugfix implemented on Ib37651f7f802debd472ab292b148c2a2496063a3
    creates the network_conf_dir only during __init__, and restart()
    method calls disable(), then enable(), where disable will remove
    the config directory and all configuration files.

Previous implementation re-created the network_conf_dir on the
interface_name setter (called from enable) with ensure_conf_dir=True.

Change-Id: I6afaea062c9eed5c63fc64fb0113f460b2cf35c5
Closes-Bug: 1410982

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-01-15: Re: tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os blowing up hard in the gate

The last 48 hours show that things have become better but it looks like we are not 100% out of the woods yet.

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiaW4gdGVzdF9zbGFhY19mcm9tX29zXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIEFORCBidWlsZF9xdWV1ZTpcImNoZWNrXCIgQU5EIGZpbGVuYW1lOlwiY29uc29sZS5odG1sXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjE3MjgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MjEzNDEzNjMzNjgsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

Changed in neutron:
status:	Fix Committed → Confirmed
summary:	- tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os - blowing up hard in the gate + test_slaac_from_os frequent failures

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-01-15:

#10

Other root causes need to be determined.

Changed in neutron:
importance:	Critical → High

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-01-15:

#11

The logstash query from comment #3 still shows 'Cannot find device "None"' which started with the patch mentioned in the same comment.

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-01-15:

#12

I believe now we're back to what I found in the original bug:

https://bugs.launchpad.net/neutron/+bug/1407618 (look at the kibana link in there)

reload_allocations being called without the config directory being properly initialized,

I believe there is a restart() call into the driver, that goes wrong (exception.Conflict)
because the dhcp_port can't be allocated/got from the plugin, that's captured and printed here [1]

Then reload_allocations is called afterwards, but everything is on an inconsistent state
(dnsmasq hasn't been started by the restart), and reload allocation fails [2]

[1] http://logs.openstack.org/89/147289/2/check/check-tempest-dsvm-neutron-full/827799c/logs/screen-q-dhcp.txt.gz#_2015-01-15_16_48_10_653
[2] http://logs.openstack.org/89/147289/2/check/check-tempest-dsvm-neutron-full/827799c/logs/screen-q-dhcp.txt.gz?#_2015-01-15_16_48_10_656

So, now we're back to the initial state, but it seems that my diagnosis is more accurate. Now
we need to understand why the plugin is failing to return a dhcp_port or we're failing to recover
it for some reason.

[centos@devstack neutron]$ grep Conflict -C4 -n agent/linux/dhcp.py
910- dhcp_port = self.plugin.update_dhcp_port(
911- port.id, {'port': {'network_id': network.id,
912- 'fixed_ips': port_fixed_ips}})
913- if not dhcp_port:
914: raise exceptions.Conflict()
915- else:
916- dhcp_port = port
917- # break since we found port that matches device_id
918- break
--
945- fixed_ips=[dict(subnet_id=s) for s in dhcp_enabled_subnet_ids])
946- dhcp_port = self.plugin.create_dhcp_port({'port': port_dict})
947-
948- if not dhcp_port:
949: raise exceptions.Conflict()
950-
951- # Convert subnet_id to subnet dict
952- fixed_ips = [dict(subnet_id=fixed_ip.subnet_id,
953- ip_address=fixed_ip.ip_address,

I will be looking at that tonight.

I believe now we're back to what I found in the original bug:

https://bugs.launchpad.net/neutron/+bug/1407618 (look at the kibana link in there)

reload_allocations being called without the config directory being properly initialized,

I believe there is a restart() call into the driver, that  goes wrong (exception.Conflict) 
because the dhcp_port can't be allocated/got from the plugin, that's captured and printed here [1]

Then reload_allocations is called afterwards, but everything is on an inconsistent state
(dnsmasq hasn't been started by the restart), and reload allocation fails [2]

So, now we're back to the initial state, but it seems that my diagnosis is more accurate. Now 
we need to understand why the plugin is failing to return a dhcp_port or we're failing to recover
it for some reason.

[centos@devstack neutron]$ grep Conflict -C4 -n  agent/linux/dhcp.py
910-                    dhcp_port = self.plugin.update_dhcp_port(
911-                        port.id, {'port': {'network_id': network.id,
912-                                           'fixed_ips': port_fixed_ips}})
913-                    if not dhcp_port:
914:                        raise exceptions.Conflict()
915-                else:
916-                    dhcp_port = port
917-                # break since we found port that matches device_id
918-                break
--
945-                fixed_ips=[dict(subnet_id=s) for s in dhcp_enabled_subnet_ids])
946-            dhcp_port = self.plugin.create_dhcp_port({'port': port_dict})
947-
948-        if not dhcp_port:
949:            raise exceptions.Conflict()
950-
951-        # Convert subnet_id to subnet dict
952-        fixed_ips = [dict(subnet_id=fixed_ip.subnet_id,
953-                          ip_address=fixed_ip.ip_address,

I will be looking at that tonight.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-15: Change abandoned on neutron (master)

#13

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/147308
Reason: This is not the root cause for the reported bug.

Revision history for this message

Brian Haley (brian-haley) wrote on 2015-01-16:

#14

Miguel - in the logstash pointer from your comment above I see this:

1. "Good" call into the namespace to list routes

/lockutils.py:202
2015-01-15 16:48:10.167 2660 DEBUG neutron.agent.linux.utils [req-f4ad1fc1-c36d-4598-b721-296a7abeee4a None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-5566b546-a31b-40ae-913f-2a72ca2ce0ad', 'ip', 'route', 'list', 'dev', 'tapfd5a26b6-8e']
Exit code: 0

2. Restart action, which is going to kill dnsmasq calling "disable":

/lockutils.py:202
2015-01-15 16:48:10.237 2660 DEBUG neutron.agent.dhcp_agent [req-820e24fc-a7ee-440e-b9c9-80653cd4c8d6 None] Calling driver for network: 5566b546-a31b-40ae-913f-2a72ca2ce0ad action: restart call_driver /opt/stack/new/neutron/neutron/agent/dhcp_agent.py:129
2015-01-15 16:48:10.237 2660 DEBUG neutron.agent.linux.utils [req-820e24fc-a7ee-440e-b9c9-80653cd4c8d6 None] Running command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '14952'] create_process /opt/stack/new/neutron/neutron/agent/linux/utils.py:46

2a. A side-effect of this call to disable is:

self._remove_config_files()

That will remove the "interface" file from the directory.

3. Next use of self.interface_name is going to try and pull the value from the file that was just removed, and instead will return None, which leads to this:

2015-01-15 16:48:11.045 2660 ERROR neutron.agent.linux.utils [req-b4899e34-1bfc-4d5d-882f-449307039c83 None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-5566b546-a31b-40ae-913f-2a72ca2ce0ad', 'ip', 'route', 'list', 'dev', 'None']
Exit code: 1

So seems as though if we are retaining the port (I believe retain_port=True is passed), then we don't want to remove that interface file, since it will still be the correct interface.

Does that make sense?

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-01-22:

#15

The higher rate was introduced by my patch https://review.openstack.org/#/c/144977,
and corrected by https://review.openstack.org/147347.

But, there is a baseline of errors that was happening before, and it keeps happening.

I have found that the errors like:

Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-5566b546-a31b-40ae-913f-2a72ca2ce0ad', 'ip', 'route', 'list', 'dev', 'None']
Exit code: 1

Still show up in the logs, even for fully passed tempest runs. And I believe it's basically a
harmless race condition between neutron-server and the neutron-dhcp-agent, when a
subnet get's deleted sometimes and a restart() had been triggered for some other reason:
1) the restart happens, 2) the subnet has been deleted, so fails as Conflict, 3) the reload_allocations
happens, but restart didn't succeed, so it finds everything in an inconsistent state, but we
really don't care since the subnet is deleted, we couldn't restart it because we are not
going to be able to get the dhcp port to the net, which is gone.

We could, by the way, make reload_allocations fail more gracefully, and avoid such transient/harmless
failures confuse our log inspection. I will propose a patch in that direction.

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-02-23:

#16

The baseline errors seems to have gone away, and when I look into errors
related to test_slaac_from_os they are paired to massive network problems,
not specifically to that.

I'm not sure if this bug is obsolete now, or if we need to dig deeper into the logs.

Changed in neutron:
assignee:	Miguel Angel Ajo (mangelajo) → nobody

Armando Migliaccio (armando-migliaccio) on 2015-02-24

Changed in neutron:
importance:	High → Medium

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-02-24:

#17

When I look at the last 7 days query:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiaW5pdF9sM1wiIEFORCBtZXNzYWdlOlwiUlRORVRMSU5LIGFuc3dlcnM6IEZpbGUgZXhpc3RzXCIgQU5EIHRhZ3M6XCJzY3JlZW4tcS1kaGNwLnR4dFwiIEFORCBidWlsZF9zdGF0dXM6XCJGQUlMVVJFXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MjQ3OTI5MTUwMDgsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

it sounds to me that this issue has either become latent, so I'd argue that failures are no longer frequent and this bug report can be safely closed.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-02-24:

#18

The root cause that led to the spike has merged here:

https://review.openstack.org/147347

Even though we accidentally fail with the same failure mode, the root cause is different and it should probably be tracked (if we find a good soul) in a different bug.

Changed in neutron:
status:	Confirmed → Fix Committed
assignee:	nobody → Miguel Angel Ajo (mangelajo)

Thierry Carrez (ttx) on 2015-03-19

Changed in neutron:
milestone:	none → kilo-3
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in neutron:
milestone:	kilo-3 → 2015.1.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.