Neutron server unable to sync HA info after race between HA router creating and deleting

Bug #1533457 reported by LIU Yulong
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
LIU Yulong

Bug Description

Neutron server will not be able to sync ha router data after race happened between get_ha_router_port_bindings and HA router deleting API call.

Exception:
File "neutron/db/db_base_plugin_v2.py", line 921, in _make_port_dict
    res = {"id": port["id"],
TypeError: 'NoneType' object has no attribute '__getitem__'

Trace:
http://paste.openstack.org/show/473839/

The new trace:
neutron server side:
http://paste.openstack.org/show/489511/
l3 agent side:
http://paste.openstack.org/show/489509/

Revision history for this message
ugvddm (271025598-9) wrote :

please get more information, log or how to reproduce it, thanks

Changed in neutron:
status: New → Incomplete
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: Incomplete → In Progress
Revision history for this message
LIU Yulong (dragon889) wrote :

@ugvddm (271025598-9), thank you, for more information please see here:
https://bugs.launchpad.net/neutron/+bug/1523780
and the patch is here:
https://review.openstack.org/#/c/265680/

tags: added: kilo-backport-potential liberty-backport-potential
LIU Yulong (dragon889)
tags: added: l3-ha
Revision history for this message
LIU Yulong (dragon889) wrote :
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Please include the full stack traces in your bug reports. This by itself is not helpful to reproduce or even understand the issue.

LIU Yulong (dragon889)
description: updated
LIU Yulong (dragon889)
description: updated
description: updated
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
John Schwarz (jschwarz) wrote :

Latest developments: this has been re-encountered in https://bugs.launchpad.net/neutron/+bug/1605546. That bug has been marked as duplicate of this one and a patch (https://review.openstack.org/#/c/265680/) is being worked on that should resolve this issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/348215

John Schwarz (jschwarz)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/265680
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=179b8301edad50f999417f52b77092a496fb448e
Submitter: Jenkins
Branch: master

commit 179b8301edad50f999417f52b77092a496fb448e
Author: LIU Yulong <email address hidden>
Date: Mon Jan 11 11:31:36 2016 +0800

    Filter HA router without HA port bindings after race conditions

    Neutron server will not be able to sync ha router data after
    race happened between get_ha_router_port_bindings and HA router
    deleting API call. Once the ports of L3HARouterAgentPortBinding
    were deleted the _process_sync_ha_data may get a None binding
    port, and then the _process_sync_ha_data will fail to get the
    HA interface port info due to the None port. This patch will
    filter the bindings without port.

    Change-Id: Ie38baf061d678fc5d768195b25241efbad74e42f
    Closes-Bug: #1533457

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/349238

Revision history for this message
Bernhard (b-krieger) wrote :
Download full text (13.8 KiB)

After i have applied the patch i get still the exceptions

from vpn-agent.log:

2016-08-01 09:42:51.600 53157 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-a9806b97-0850-4f40-8c73-c1fac666d21b', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:100
2016-08-01 09:42:51.613 53157 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:142
2016-08-01 09:42:51.613 53157 DEBUG oslo_concurrency.lockutils [-] Lock "l3-agent-pd" acquired by "neutron.agent.linux.pd.sync_router" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270
2016-08-01 09:42:51.613 53157 DEBUG oslo_concurrency.lockutils [-] Lock "l3-agent-pd" released by "neutron.agent.linux.pd.sync_router" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2016-08-01 09:42:51.614 53157 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-a9806b97-0850-4f40-8c73-c1fac666d21b', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:100
2016-08-01 09:42:51.625 53157 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:142
2016-08-01 09:42:51.626 53157 DEBUG neutron.agent.l3.router_info [-] Terminating radvd daemon in router device: a9806b97-0850-4f40-8c73-c1fac666d21b disable_radvd /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:446
2016-08-01 09:42:51.626 53157 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/a9806b97-0850-4f40-8c73-c1fac666d21b.pid.radvd get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:225
2016-08-01 09:42:51.627 53157 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/a9806b97-0850-4f40-8c73-c1fac666d21b.pid.radvd get_value_from_file /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:225
2016-08-01 09:42:51.627 53157 DEBUG neutron.agent.linux.external_process [-] No process started for a9806b97-0850-4f40-8c73-c1fac666d21b disable /usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py:118
2016-08-01 09:42:51.627 53157 DEBUG neutron.agent.linux.ra [-] radvd disabled for router a9806b97-0850-4f40-8c73-c1fac666d21b disable /usr/lib/python2.7/site-packages/neutron/agent/linux/ra.py:190
2016-08-01 09:42:51.627 53157 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-a9806b97-0850-4f40-8c73-c1fac666d21b', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:100
2016-08-01 09:42:51.632 53157 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-a9806b97-0850-4f40-8c73-c1fac666d21b": No such file or dir...

Revision history for this message
Bernhard (b-krieger) wrote :
Revision history for this message
LIU Yulong (dragon889) wrote :

@Bernhard (b-krieger), hi, according to the log you added here, that issue was handled here:
https://bugs.launchpad.net/neutron/+bug/1607381
https://review.openstack.org/#/c/265672/

And in here:
https://review.openstack.org/#/c/265672/10//COMMIT_MSG
John and I have discussed something about that,
the l3 agent still have chance to get a HA router without ha_port.

Revision history for this message
John Schwarz (jschwarz) wrote :

@Bernhard, it seems like you have stumbled into https://bugs.launchpad.net/neutron/+bug/1606844 which deals with the specific issue you've encountered.

However, the issue is triggered by some error in the server-side - please look at the neutron-server side for errors at roughly that time and provide traces/warnings/errors that appear there.

John.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/349238
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=55a35196f9b9029c8e8428b75ec7f7cf0d13cff5
Submitter: Jenkins
Branch: stable/mitaka

commit 55a35196f9b9029c8e8428b75ec7f7cf0d13cff5
Author: LIU Yulong <email address hidden>
Date: Mon Jan 11 11:31:36 2016 +0800

    Filter HA router without HA port bindings after race conditions

    Neutron server will not be able to sync ha router data after
    race happened between get_ha_router_port_bindings and HA router
    deleting API call. Once the ports of L3HARouterAgentPortBinding
    were deleted the _process_sync_ha_data may get a None binding
    port, and then the _process_sync_ha_data will fail to get the
    HA interface port info due to the None port. This patch will
    filter the bindings without port.

    Change-Id: Ie38baf061d678fc5d768195b25241efbad74e42f
    Closes-Bug: #1533457
    (cherry picked from commit 179b8301edad50f999417f52b77092a496fb448e)

tags: added: in-stable-mitaka
Revision history for this message
Bernhard (b-krieger) wrote :

Added this patch and patch for bug 1606844 ( as john mentioned).
Still getting into race condition when i delete a router.

http://paste.openstack.org/show/565708/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/348215
Reason: This isn't necessary any more

tags: removed: kilo-backport-potential liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.