Delete Router / race condition

Bug #1635554 reported by Bernhard
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
John Schwarz

Bug Description

When deleting a router the logfile is filled up.

CentOS7
Newton(RDO)

2016-10-21 09:45:02.526 16200 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:140
2016-10-21 09:45:02.526 16200 WARNING neutron.agent.l3.namespaces [-] Namespace qrouter-8cf22225-5c5c-461c-84f3-c8abeca8f79a does not exist. Skipping delete
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent [-] Error while deleting router 8cf22225-5c5c-461c-84f3-c8abeca8f79a
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 357, in _safe_router_removed
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent self._router_removed(router_id)
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 376, in _router_removed
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent ri.delete(self)
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 381, in delete
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent self.destroy_state_change_monitor(self.process_monitor)
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 325, in destroy_state_change_monitor
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent pm = self._get_state_change_monitor_process_manager()
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 296, in _get_state_change_monitor_process_manager
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent default_cmd_callback=self._get_state_change_monitor_callback())
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 299, in _get_state_change_monitor_callback
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent ha_device = self.get_ha_device_name()
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 137, in get_ha_device_name
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent return (HA_DEV_PREFIX + self.ha_port['id'])[:self.driver.DEV_NAME_LEN]
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent TypeError: 'NoneType' object has no attribute '__getitem__'
2016-10-21 09:45:02.527 16200 ERROR neutron.agent.l3.agent
2016-10-21 09:45:02.528 16200 DEBUG neutron.agent.l3.agent [-] Finished a router update for 8cf22225-5c5c-461c-84f3-c8abeca8f79a _process_router_update /usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py:504

See full log
http://paste.openstack.org/show/586656/

Revision history for this message
LIU Yulong (dragon889) wrote :

Thanks Bernhard, could you please add more logs to see if the following line [1] was triggered.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha_router.py#L81

Changed in neutron:
status: New → Incomplete
Changed in neutron:
status: Incomplete → New
tags: added: l3-ha
Revision history for this message
Bernhard (b-krieger) wrote :
Revision history for this message
Bernhard (b-krieger) wrote :
Revision history for this message
John Schwarz (jschwarz) wrote :

This seems a lot like https://bugs.launchpad.net/neutron/+bug/1606844, but perhaps that bug's patches didn't fix all the cases.

Changed in neutron:
importance: Undecided → Medium
assignee: nobody → John Schwarz (jschwarz)
importance: Medium → High
John Schwarz (jschwarz)
tags: added: l3-dvr-backlog
Revision history for this message
LIU Yulong (dragon889) wrote :

Hi, Bernhard, your pasted log only contains 2016-10-24, but this bug was filed 2016-10-21, and the bug description is also that day. The log tar file should be that day. : )

Revision history for this message
LIU Yulong (dragon889) wrote :

@John, I just abandoned the HA infinite loop patch, seems needs restore it again.
https://review.openstack.org/#/c/265672/

Revision history for this message
Bernhard (b-krieger) wrote :
Revision history for this message
John Schwarz (jschwarz) wrote :

Bernhard, sorry to be a nag, but could you also provide neutron-server logs from the controller(s)? The error you posted at the first link shows a symptom on the l3 agents, but the real cause is usually a race condition on the controller.

Revision history for this message
Bernhard (b-krieger) wrote :

On control node the debug/verbose option wasnt set to true so i triggered a new race condition.

Revision history for this message
John Schwarz (jschwarz) wrote :

Looking at the server logs, I can't see any errors or stacktraces relating to this issue on the server side. Looking a bit deeper into the network node logs, it seems like there are both l3-agent.log and vpn-agent.log, which implies that each network node has 2 different agents - l3-agent and another l3-agent that specifically has the vpnaas code enabled. Is this accurate?

If this is indeed the case, than (afaik) the vpnaas agent is supposed to replace the pre-existing l3-agent (so there should be only one of [l3-agent, vpn-agent] on one node in any given moment). This is because the vpn-agent inherits most of the functionalities that the l3-agent provides and adds to that. This also explains the error - the l3-agent.log shows that it's successful in deleting the namespace, and then the vpnaas-agent also tries to do that (and fails).

In other words, the vpnaas agent should replace the l3-agent (and should provide everything that was removed anyway, because it's running that code). If there are 2 instances, removing one of them should immediately mitigate the issue you're encountering.

If this isn't the case, please provide a 'neutron agent-list' output, as well as write down exactly what you're doing to reproduce this (I just tried on an upstream master and it looks to be working perfectly with a single l3-agent in each node).

Revision history for this message
Bernhard (b-krieger) wrote :

Indeed i found different l3-agents on both nodes.
I reconfigure the node that the vpn agent runs on both nodes.
i deleted the agents and add them again.
Now it looks ok.
But the errors still occurs.
I uploaded new logfiles.

[root@control]# neutron agent-list
+--------------------------------------+---------------------------+----------------+-------------------+-------+----------------+---------------------------+
| id | agent_type | host | availability_zone | alive | admin_state_up | binary |
+--------------------------------------+---------------------------+----------------+-------------------+-------+----------------+---------------------------+
| 05ba5aba-6b5c-4ebe-8226-acb3fd4f8887 | BGP dynamic routing agent | network2.cloud | | :-) | True | neutron-bgp-dragent |
| 0f0aac4f-7e66-4c49-b4bb-e8ce9ad0b70f | Open vSwitch agent | compute0-cloud | | :-) | True | neutron-openvswitch-agent |
| 233f7379-e53b-49ed-aa19-e7d6cec7d9b9 | L3 agent | network3.cloud | nova | :-) | True | neutron-l3-agent |
| 279a52db-e61c-4765-bf64-69b4fcad8e8d | Metering agent | network3.cloud | | :-) | True | neutron-metering-agent |
| 3676b17b-6c0a-4d27-afde-44d4556cf3ee | Open vSwitch agent | network2.cloud | | :-) | True | neutron-openvswitch-agent |
| 45ac0503-3662-49ee-bc3a-3ca0153f592a | DHCP agent | network3.cloud | nova | :-) | True | neutron-dhcp-agent |
| 6ff9bdf1-45b6-4ea9-aa93-210d739c3995 | Open vSwitch agent | network3.cloud | | :-) | True | neutron-openvswitch-agent |
| 7ab48210-2c6e-43e8-9f4a-42f4b2dcdba3 | Metadata agent | network3.cloud | | :-) | True | neutron-metadata-agent |
| 86c48b50-f76d-4ed0-88f8-ac5588797fb4 | Metadata agent | network2.cloud | | :-) | True | neutron-metadata-agent |
| 9ed54ae7-e1df-469c-803a-6dcd83b34ad6 | Open vSwitch agent | compute1-cloud | | :-) | True | neutron-openvswitch-agent |
| bb4101e9-bc67-4c48-a4d8-ce6334850f4b | BGP dynamic routing agent | network3.cloud | | :-) | True | neutron-bgp-dragent |
| d0b05ebe-195b-4ebe-b15c-1390b5a13bfa | L3 agent | network2.cloud | nova | :-) | True | neutron-l3-agent |
| f31c737c-ea6a-46ab-9cfe-db43c7daa357 | Metering agent | network2.cloud | | :-) | True | neutron-metering-agent |
| ffc59751-f904-4011-b56b-5e6d86504a4f | DHCP agent | network2.cloud | nova | :-) | True | neutron-dhcp-agent |
+--------------------------------------+---------------------------+----------------+-------------------+-------+----------------+---------------------------+

Revision history for this message
Bernhard (b-krieger) wrote :

[root@control3-cloud neutron]# neutron agent-list -f table
+--------------------------------------+---------------------------+----------------+-------------------+-------+----------------+---------------------------+
| id | agent_type | host | availability_zone | alive | admin_state_up | binary |
+--------------------------------------+---------------------------+----------------+-------------------+-------+----------------+---------------------------+
| 05ba5aba-6b5c-4ebe-8226-acb3fd4f8887 | BGP dynamic routing agent | network2.cloud | | :-) | True | neutron-bgp-dragent |
| 0f0aac4f-7e66-4c49-b4bb-e8ce9ad0b70f | Open vSwitch agent | compute0-cloud | | :-) | True | neutron-openvswitch-agent |
| 233f7379-e53b-49ed-aa19-e7d6cec7d9b9 | L3 agent | network3.cloud | nova | :-) | True | neutron-l3-agent |
| 279a52db-e61c-4765-bf64-69b4fcad8e8d | Metering agent | network3.cloud | | :-) | True | neutron-metering-agent |
| 3676b17b-6c0a-4d27-afde-44d4556cf3ee | Open vSwitch agent | network2.cloud | | :-) | True | neutron-openvswitch-agent |
| 45ac0503-3662-49ee-bc3a-3ca0153f592a | DHCP agent | network3.cloud | nova | :-) | True | neutron-dhcp-agent |
| 6ff9bdf1-45b6-4ea9-aa93-210d739c3995 | Open vSwitch agent | network3.cloud | | :-) | True | neutron-openvswitch-agent |
| 7ab48210-2c6e-43e8-9f4a-42f4b2dcdba3 | Metadata agent | network3.cloud | | :-) | True | neutron-metadata-agent |
| 86c48b50-f76d-4ed0-88f8-ac5588797fb4 | Metadata agent | network2.cloud | | :-) | True | neutron-metadata-agent |
| 9ed54ae7-e1df-469c-803a-6dcd83b34ad6 | Open vSwitch agent | compute1-cloud | | :-) | True | neutron-openvswitch-agent |
| bb4101e9-bc67-4c48-a4d8-ce6334850f4b | BGP dynamic routing agent | network3.cloud | | :-) | True | neutron-bgp-dragent |
| d0b05ebe-195b-4ebe-b15c-1390b5a13bfa | L3 agent | network2.cloud | nova | :-) | True | neutron-l3-agent |
| f31c737c-ea6a-46ab-9cfe-db43c7daa357 | Metering agent | network2.cloud | | :-) | True | neutron-metering-agent |
| ffc59751-f904-4011-b56b-5e6d86504a4f | DHCP agent | network2.cloud | nova | :-) | True | neutron-dhcp-agent |
+--------------------------------------+---------------------------+----------------+-------------------+-------+----------------+---------------------------+

Revision history for this message
John Schwarz (jschwarz) wrote :

I don't have access to an deployment with vpnaas (and this doesn't reproduce with a normal l3-agent on the master branch) - could you please create a new HA router and run:
1. 'ls -la /var/lib/neutron/ha_confs/<router_id>' (where <router_id> is the uuid of the router that was created in the previous step)
2. 'ip netns | grep <router_id>' (likewise)
and provide the outputs?

Also, please provide the relevant log files from the creation of the router (vpn agents and controllers).

John.

Revision history for this message
John Schwarz (jschwarz) wrote :

Marking Incomplete until more data is supplied on this bug.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Bernhard (b-krieger) wrote :

I rechecked the configuration.
I recognized that l3-agent and vpn-agent running together on both network nodes :(
After disabling l3-agent and only vpn-agent enabled everything works fine.

Thank you and sorry...

bernhard

Revision history for this message
John Schwarz (jschwarz) wrote :

No worries :) Glad we could help.

Changed in neutron:
status: Incomplete → Invalid
importance: High → Undecided
Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Hi guys,

We having the same issue, we are only running VPN agent.
We are using Mitaka with Neutron 8.4.0.

L3 agent has never been started.

2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent [-] Error while deleting router 93eb813c-a991-4061-a836-6a15fee82e3e
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 359, in _safe_router_removed
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent self._router_removed(router_id)
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 377, in _router_removed
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent ri.delete(self)
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 397, in delete
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent self.destroy_state_change_monitor(self.process_monitor)
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 344, in destroy_state_change_monitor
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent pm = self._get_state_change_monitor_process_manager()
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 315, in _get_state_change_monitor_process_manager
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent default_cmd_callback=self._get_state_change_monitor_callback())
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 318, in _get_state_change_monitor_callback
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent ha_device = self.get_ha_device_name()
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 156, in get_ha_device_name
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent return (HA_DEV_PREFIX + self.ha_port['id'])[:self.driver.DEV_NAME_LEN]
2017-05-22 14:24:03.357 7 ERROR neutron.agent.l3.agent TypeError: 'NoneType' object has no attribute '__getitem__'

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi guys,

We having the same issue, we are only running l3 agent.
We are using Mitaka with Neutron 8.4.0.

In addition to seeing the above same errors in the bug description, we also saw the error 'Error while deleting router xxxxxxxxxxxxxxxx' 3343386 times from _safe_router_removed() [1]:

$ grep -r 'Error while deleting router c0dab368-5ac8-4996-88c9-f5d345a774a6' |wc -l
3343386

I made some brief analysis:

This _safe_router_removed() should be invoked in L488 [2], I also pasted it as below for convenience:

    def _process_router_update(self):
            ...
            router = update.router
            ...
            if not router:
                removed = self._safe_router_removed(update.id)
                if not removed:
                    self._resync_router(update)

    def _safe_router_removed(self, router_id):
        try:
            self._router_removed(router_id)
        except Exception:
            LOG.exception(_LE('Error while deleting router %s'), router_id)
            return False
        else:
            return True

If _safe_router_removed() goes wrong it will return False, then self._resync_router(update) [3] will make the code _safe_router_removed be run again and again.

    def _resync_router(self, router_update,
                       priority=queue.PRIORITY_SYNC_ROUTERS_TASK):
        router_update.timestamp = timeutils.utcnow()
        router_update.priority = priority
        router_update.router = None # Force the agent to resync the router
        self._queue.add(router_update)

So we saw so many errors 'Error while deleting router XXXXX'.

[1] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L361
[2] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L488
[3] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L457
[4] https://github.com/openstack/neutron/blob/mitaka-eol/neutron/agent/l3/agent.py#L487

Hua Zhang (zhhuabj)
Changed in neutron:
status: Invalid → Confirmed
Revision history for this message
Brian Haley (brian-haley) wrote :

Mitaka is EOL, can you reproduce this on Ocata or master branch?

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

We should probably close this bug and reopen another one if seen in either Master/Pike or Ocata.

Changed in neutron:
status: Confirmed → Incomplete
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Just note that the problem can't be reproduced on Ocata or master branch because it has been fixed in this bug - https://bugs.launchpad.net/neutron/+bug/1668410

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

In that case we should close this bug.

Changed in neutron:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.