test_ha_router fails intermittently

Bug #1499647 reported by Ann Taraday
56
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
John Schwarz
Kilo
Fix Released
Undecided
Unassigned

Bug Description

I have tested work of L3 HA on environment with 3 controllers and 1 compute (Kilo) keepalived v1.2.13 I create 50 nets with 50 subnets and 50 routers with interface is set for each subnet(Note: I've seem the same errors with just one router and net). I've got the following errors:

root@node-6:~# neutron l3-agent-list-hosting-router router-1
Request Failed: internal server error while processing your request.

In neutron-server error log: http://paste.openstack.org/show/473760/

When I fixed _get_agents_dict_for_router to skip None for further testing, so then I was able to see:

root@node-6:~# neutron l3-agent-list-hosting-router router-1
+--------------------------------------+-------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+-------------------+----------------+-------+----------+
| f3baba98-ef5d-41f8-8c74-a91b7016ba62 | node-6.domain.tld | True | :-) | active |
| c9159f09-34d4-404f-b46c-a8c18df677f3 | node-7.domain.tld | True | :-) | standby |
| b458ab49-c294-4bdb-91bf-ae375d87ff20 | node-8.domain.tld | True | :-) | standby |
| f3baba98-ef5d-41f8-8c74-a91b7016ba62 | node-6.domain.tld | True | :-) | active |
+--------------------------------------+-------------------+----------------+-------+----------+

root@node-6:~# neutron port-list --device_id=fcf150c0-f690-4265-974d-8db370e345c4
+--------------------------------------+-------------------------------------------------+-------------------+----------------------------------------------------------------------------------------+
| id | name | mac_address | fixed_ips |
+--------------------------------------+-------------------------------------------------+-------------------+----------------------------------------------------------------------------------------+
| 0834f8a2-f109-4060-9312-edebac84aba5 | | fa:16:3e:73:9f:33 | {"subnet_id": "0c7a2cfa-1cfd-4ecc-a196-ab9e97139352", "ip_address": "172.18.161.223"} |
| 2b5a7a15-98a2-4ff1-9128-67d098fa3439 | HA port tenant aef8d13bad9d42df9f25d8ee54c80ad6 | fa:16:3e:b8:f6:35 | {"subnet_id": "1915ccb8-9d0f-4f1a-9811-9a196d1e495e", "ip_address": "169.254.192.149"} |
| 48c887c1-acc3-4804-a993-b99060fa2c75 | HA port tenant aef8d13bad9d42df9f25d8ee54c80ad6 | fa:16:3e:e7:70:13 | {"subnet_id": "1915ccb8-9d0f-4f1a-9811-9a196d1e495e", "ip_address": "169.254.192.151"} |
| 82ab62d6-7dd1-4294-a0dc-f5ebfbcbb4ca | | fa:16:3e:c6:fc:74 | {"subnet_id": "c4cc21c9-3b3a-407c-b4a7-b22f783377e7", "ip_address": "10.0.40.1"} |
| bbca8575-51f1-4b42-b074-96e15aeda420 | HA port tenant aef8d13bad9d42df9f25d8ee54c80ad6 | fa:16:3e:84:4c:fc | {"subnet_id": "1915ccb8-9d0f-4f1a-9811-9a196d1e495e", "ip_address": "169.254.192.150"} |
| bee5c6d4-7e0a-4510-bb19-2ef9d60b9faf | HA port tenant aef8d13bad9d42df9f25d8ee54c80ad6 | fa:16:3e:09:a1:ae | {"subnet_id": "1915ccb8-9d0f-4f1a-9811-9a196d1e495e", "ip_address": "169.254.193.11"} |
| f8945a1d-b359-4c36-a8f8-e78c1ba992f0 | HA port tenant aef8d13bad9d42df9f25d8ee54c80ad6 | fa:16:3e:c4:54:b5 | {"subnet_id": "1915ccb8-9d0f-4f1a-9811-9a196d1e495e", "ip_address": "169.254.193.12"} |
+--------------------------------------+-------------------------------------------------+-------------------+----------------------------------------------------------------------------------------+
mysql root@192.168.0.2:neutron> SELECT * FROM ha_router_agent_port_bindings WHERE router_id='fcf150c0-f690-4265-974d-8db370e345c4';
+--------------------------------------+--------------------------------------+--------------------------------------+---------+
| port_id | router_id | l3_agent_id | state |
|--------------------------------------+--------------------------------------+--------------------------------------+---------|
| 2b5a7a15-98a2-4ff1-9128-67d098fa3439 | fcf150c0-f690-4265-974d-8db370e345c4 | c9159f09-34d4-404f-b46c-a8c18df677f3 | standby |
| 48c887c1-acc3-4804-a993-b99060fa2c75 | fcf150c0-f690-4265-974d-8db370e345c4 | b458ab49-c294-4bdb-91bf-ae375d87ff20 | standby |
| bbca8575-51f1-4b42-b074-96e15aeda420 | fcf150c0-f690-4265-974d-8db370e345c4 | <null> | standby |
| bee5c6d4-7e0a-4510-bb19-2ef9d60b9faf | fcf150c0-f690-4265-974d-8db370e345c4 | f3baba98-ef5d-41f8-8c74-a91b7016ba62 | active |
| f8945a1d-b359-4c36-a8f8-e78c1ba992f0 | fcf150c0-f690-4265-974d-8db370e345c4 | f3baba98-ef5d-41f8-8c74-a91b7016ba62 | active |
+--------------------------------------+--------------------------------------+--------------------------------------+---------+

So extra L3HARouterAgentPortBinding was created for routers. This issue does not reproduced all the time.
During sync_routers the following errors in logs appeared:

http://paste.openstack.org/show/473839/
http://paste.openstack.org/show/473840/

Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Changed in neutron:
assignee: nobody → Ann Kamyshnikova (akamyshnikova)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/227821

Revision history for this message
Assaf Muller (amuller) wrote : Re: L3 HA: extra L3HARouterAgentPortBinding created for routers

L3HARouterAgentPortBinding is added via a single method: add_ha_port (https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L312). That method is used in two places, while creating an HA router in _create_ha_interfaces during router creation (https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L398), and in the L3 agent scheduler, in auto_schedule_router, _schedule_ha_routers_to_additional_agent here https://github.com/openstack/neutron/blob/master/neutron/scheduler/l3_agent_scheduler.py#L150.

The race cannot happen between two create_routers for the same router, and it's not likely it's happening between two auto_schedule_router calls for the same router (That is invoked by sync_routers, which is an RPC method invoked by the L3 agent). So, that leaves a race between create_router and an agent invoking sync_routers on the server.

Looking at create_router in the HA routers mixin: https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L378. It's clearly not atomic, at all... I think that after the base DB object is created in line 386, if an RPC call from an agent (Say, it just started/restarted, or an error occurred and it's resyncing), sync_routers will see a router object in the DB and try to bind it to the agent. Basically, I think that an HA router can be bound after the super(L3_HA_NAT_db_mixin, self).create_router(context, router) call in line 378 but before the self._create_ha_interfaces(context, router_db, ha_network) call in line 398. I verified this by putting a break point right after the super create_router call, restarting a L3 agent, and hitting continue in pdb. After that when trying to list the router bindings for that router I got the trace described in the bug report.

Ann, Eugene - Thoughts on how to solve this issue? One way is modify the patch proposed (Keeping the new unique constraint), but in _create_ha_port_binding catching the unique constraint violation and returning the binding instead of raising an exception (i.e. changing _create_ha_port_binding to _create_or_get_ha_port_binding).

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

@Assaf Muller

Thanks a lot for your research! It helped me a lot with debugging.

About your suggestion:

I see that there is 2 problems when there is extra L3HARouterAgentPortBinding with l3_agent_id=None and when there is duplication with l3_agent_id.

When I checked more I realize that _create_ha_port_binding https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L304 creates L3HARouterAgentPortBinding with l3_agent_id=None http://paste.openstack.org/show/474814/, so UniqueConstraint won't help here. The only way to prevent this error is to check that number of already created ha port bindings for router is less than max_l3_agents_per_router. This should solution for the first problem,but for resolving the second one we will need to add UniqueConstraint to prevent updating L3HARouterAgentPortBinding with duplicate.

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Assaf Muller (amuller) wrote :

You're right that during the create_router flow we added bindings with l3_agent=None, and only populate that field later during router scheduling. The other place you can add bindings (With l3_agent already populated) is the router RPC sync, as stated earlier. In this case, what I think is happening is that the RPC sync call is adding a binding (With l3_agent already populated), then the create_router flow adds one too many bindings. In this case, the unique constraint addition is not needed. You can't have (Right?) two racing RPC calls coming from the same agent. The only issue we have to solve is to make sure that create_router adds the appropriate number of bindings, it cannot assume that bindings don't already exist.

Having said all that, I'm not sure if this solution is even correct. What if the create_router flow added the base router DB object, then the RPC sync call comes in. At this point, an HA router doesn't exist, and the router's VRID is not set either. The RPC call will add a binding, and when it tries to create an HA port it will fail because the HA network doesn't exist yet. Maybe the create_router created the HA network in time but didn't set the VRID yet. In this case the RPC call will most likely succeed, but the agent will fail to configure the router because the VRID field is empty. This is ugly!

The simpler and more robust solution is to make the HA router create_router method atomic, put everything apart from the notification in a transaction. The issue here is that we use the core plugin to create ports and networks, and those calls can involve HTTP and RPC calls.

I'm not sure what is the right solution here. Thoughts?

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

I agreed with your analysis.

The strange thing: when I tested adding a transaction helped only when I add transaction for everything in create_router, even notification, when I put them separately I got extra port created... I tested multiple times. I will update change with new version.

Revision history for this message
Assaf Muller (amuller) wrote :

Wow that is weird. Any idea how that could happen? I don't understand how putting the notification in or out of the transaction can affect the results of your test. For correctness case I think you must commit first then notify the agent, otherwise you risk the agent querying about a router that doesn't even exist yet in the DB. Also putting an RPC method in the transaction violates the cardinal rule of keeping transactions as short as possible.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

This all seems very strange for me also.

I did more testing today. I apply patch with refactor https://review.openstack.org/#/c/230481/ (in create_router notifications are outside of transaction) . There was duplicates, I added UniqueConstraint after that appeared extra bindings with l3_agent_id=None. I've added transaction in create_ha_port_and_bind in neutron/scheduler/l3_agent_scheduler.py Extra bindings with l3_agent_id=None still appeared. I'll push this changes and will think over weekend what else could be done.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

I've added transactions in places where agent_id is set in port_bindings: https://github.com/openstack/neutron/blob/master/neutron/scheduler/l3_agent_scheduler.py#L295-L297 and https://github.com/openstack/neutron/blob/master/neutron/scheduler/l3_agent_scheduler.py#L330-L331. I've also added a unique constraint for l3_agent_id and router_id pair.

This helped and l3habindings with duplications or with l3_agent_id=None don't appear anymore.

Revision history for this message
Assaf Muller (amuller) wrote :

Also affects L3 HA fullstack tests. Getting the TRACE pasted here: http://paste.openstack.org/show/473760/.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/238122

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/238123

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Changing title to improve discoverability. This was duplicated already twice and Launchpad is a bit dumb.

summary: - L3 HA: extra L3HARouterAgentPortBinding created for routers
+ test_ha_router fails intermittently
Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Ann Kamyshnikova (akamyshnikova)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/227821
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3fef15a40b2714c1a372216ce60cc1384dc48c02
Submitter: Jenkins
Branch: master

commit 3fef15a40b2714c1a372216ce60cc1384dc48c02
Author: Ann Kamyshnikova <email address hidden>
Date: Fri Sep 25 15:30:30 2015 +0300

    Add transaction for setting agent_id in L3HARouterAgentPortBinding

    To avoid having extra L3HARouterAgentPortBinding with l3_agent as
    None, operation of setting l3_agent should be atomic.
    For this purpose, transaction was added in methods
    create_ha_port_and_bind and _bind_ha_router_to_agents.

    Closes-Bug: #1499647

    Change-Id: Iaad82fe522cfd70061daecf411c924fdc11b7e41

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/238123
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0b8f9d0948cdb429c4b67ba138640ae515ffa1b2
Submitter: Jenkins
Branch: master

commit 0b8f9d0948cdb429c4b67ba138640ae515ffa1b2
Author: Ann Kamyshnikova <email address hidden>
Date: Wed Oct 21 17:37:34 2015 +0300

    Skip bindings with agent_id=None

    To avoid having extra L3HARouterAgentPortBinding with l3_agent as None,
    operation of setting l3_agent should be atomic.
    For this purpose, transaction was added in methods
    create_ha_port_and_bind and _bind_ha_router_to_agents in change
    Iaad82fe522cfd70061daecf411c924fdc11b7e41

    In case if router was just created and l3 agent was not scheduled yet,
    so l3_agent_id is None, l3-agent-list-hosting-router <router> will fail.
    This change makes it work by skipping binding with agent_id=None.

    Partial-bug: #1499647

    Change-Id: I1aaf4b651f738febc26b0e1105aeabe066bca2a0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/246248

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/246248
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bbbd0b95fb2c85d7089c8a4e3eb18b9372450ed9
Submitter: Jenkins
Branch: stable/liberty

commit bbbd0b95fb2c85d7089c8a4e3eb18b9372450ed9
Author: Ann Kamyshnikova <email address hidden>
Date: Fri Sep 25 15:30:30 2015 +0300

    Add transaction for setting agent_id in L3HARouterAgentPortBinding

    To avoid having extra L3HARouterAgentPortBinding with l3_agent as
    None, operation of setting l3_agent should be atomic.
    For this purpose, transaction was added in methods
    create_ha_port_and_bind and _bind_ha_router_to_agents.

    Closes-Bug: #1499647

    Change-Id: Iaad82fe522cfd70061daecf411c924fdc11b7e41
    (cherry picked from commit 3fef15a40b2714c1a372216ce60cc1384dc48c02)

tags: added: in-stable-liberty
Assaf Muller (amuller)
Changed in neutron:
status: Fix Committed → In Progress
tags: added: fullstack
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/251931

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/251931
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=92735379fe1511139829003527777dec0396b211
Submitter: Jenkins
Branch: stable/liberty

commit 92735379fe1511139829003527777dec0396b211
Author: Ann Kamyshnikova <email address hidden>
Date: Wed Oct 21 17:37:34 2015 +0300

    Skip bindings with agent_id=None

    To avoid having extra L3HARouterAgentPortBinding with l3_agent as None,
    operation of setting l3_agent should be atomic.
    For this purpose, transaction was added in methods
    create_ha_port_and_bind and _bind_ha_router_to_agents in change
    Iaad82fe522cfd70061daecf411c924fdc11b7e41

    In case if router was just created and l3 agent was not scheduled yet,
    so l3_agent_id is None, l3-agent-list-hosting-router <router> will fail.
    This change makes it work by skipping binding with agent_id=None.

    Partial-bug: #1499647

    Change-Id: I1aaf4b651f738febc26b0e1105aeabe066bca2a0
    (cherry picked from commit 0b8f9d0948cdb429c4b67ba138640ae515ffa1b2)

Revision history for this message
LIU Yulong (dragon889) wrote :

what about the exception:
"DBReferenceError: (IntegrityError) (1452, 'Cannot add or update a child row: a foreign key constraint fails (`neutron`.`ha_router_agent_port_bindings`, CONSTRAINT `ha_router_agent_port_bindings_ibfk_2` FOREIGN KEY (`router_id`) REFERENCES `routers` (`id`) ON DELETE CASCADE)') 'INSERT INTO ha_router_agent_port_bindings (port_id, router_id, l3_agent_id, state) VALUES (%s, %s, %s, %s)' ('f368c83f-40aa-45b4-89b5-d0ae6424ebb1', 'a1e4b69c-21a5-4bd4-b5e8-7e7346b29d85', None, 'standby')\n"],

seems that the merged patch did not solve this.
There has race between ha router delete and router update.
I report a separated https://bugs.launchpad.net/neutron/+bug/1522268.
And the race will also cause the bug https://bugs.launchpad.net/neutron/+bug/1510757

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

@LIU Yulong (dragon889)

In this bug such error is not reported. I think that two of existing patches https://review.openstack.org/#/c/230481/ and https://review.openstack.org/#/c/238122/ will help to fix such error.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.1

This issue was fixed in the openstack/neutron 7.0.1 release.

Revision history for this message
LIU Yulong (dragon889) wrote :

Thanky you guys.
@Ann, The error is in last paste http://paste.openstack.org/show/473840/ in the "Bug Description". If this bug do not handle this DBReferenceError error, maybe someone can remove Bug #1522268 from the "Duplicates of this bug".

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/255911

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/255911
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8aca427153485a8e3918497ed4e994c306dcb9fa
Submitter: Jenkins
Branch: stable/kilo

commit 8aca427153485a8e3918497ed4e994c306dcb9fa
Author: Ann Kamyshnikova <email address hidden>
Date: Wed Oct 21 17:37:34 2015 +0300

    Skip bindings with agent_id=None

    To avoid having extra L3HARouterAgentPortBinding with l3_agent as None,
    operation of setting l3_agent should be atomic.
    For this purpose, transaction was added in methods
    create_ha_port_and_bind and _bind_ha_router_to_agents in change
    Iaad82fe522cfd70061daecf411c924fdc11b7e41

    In case if router was just created and l3 agent was not scheduled yet,
    so l3_agent_id is None, l3-agent-list-hosting-router <router> will fail.
    This change makes it work by skipping binding with agent_id=None.

    Partial-bug: #1499647

    Change-Id: I1aaf4b651f738febc26b0e1105aeabe066bca2a0
    (cherry picked from commit 0b8f9d0948cdb429c4b67ba138640ae515ffa1b2)

tags: added: in-stable-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/257857

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/257857
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cd978f8f806a32a5e6382a8d06ca1cc92917d73b
Submitter: Jenkins
Branch: stable/kilo

commit cd978f8f806a32a5e6382a8d06ca1cc92917d73b
Author: Ann Kamyshnikova <email address hidden>
Date: Fri Sep 25 15:30:30 2015 +0300

    Add transaction for setting agent_id in L3HARouterAgentPortBinding

    To avoid having extra L3HARouterAgentPortBinding with l3_agent as
    None, operation of setting l3_agent should be atomic.
    For this purpose, transaction was added in methods
    create_ha_port_and_bind and _bind_ha_router_to_agents.

    Closes-Bug: #1499647

    Conflicts:
     neutron/scheduler/l3_agent_scheduler.py

    Change-Id: Iaad82fe522cfd70061daecf411c924fdc11b7e41
    (cherry picked from commit 3fef15a40b2714c1a372216ce60cc1384dc48c02)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/238122
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=756864270aa87fadb4b2090f180577664cabc415
Submitter: Jenkins
Branch: master

commit 756864270aa87fadb4b2090f180577664cabc415
Author: Ann Kamyshnikova <email address hidden>
Date: Wed Oct 21 17:16:38 2015 +0300

    Add UniqueConstraint in L3HARouterAgentPortBinding

    It is expected that pair router_id and l3_agent_id will be unique
    in table ha_router_agent_port_bindings. As it appeared that
    duplicates can be added this change adds UniqueConstraint for
    this columns.

    Having duplicates is odd and leads to problems during sync_routers.

    DBReferenceError will be caught create_ha_port_and_bind and
    _bind_ha_router_to_agents(l3_agent_scheduler.py) as
    L3HARouterAgentPortBinding are created with l3_agent_id=None
    in _create_ha_port_binding (l3_hamode_db.py)

    Change-Id: I7ac2283752deaa3d9601b83859a46b9e89940269
    Partial-bug: #1499647

John Schwarz (jschwarz)
Changed in neutron:
assignee: Ann Kamyshnikova (akamyshnikova) → John Schwarz (jschwarz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/284400

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/285480

Changed in neutron:
assignee: John Schwarz (jschwarz) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/285572
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=046be0b8f30291cd029e6e97a4c6c5a1717a8bd1
Submitter: Jenkins
Branch: master

commit 046be0b8f30291cd029e6e97a4c6c5a1717a8bd1
Author: Kevin Benton <email address hidden>
Date: Wed Feb 24 13:30:24 2016 -0800

    Filter HA routers without HA interface and state

    This patch adjusts the sync method to exclude any HA
    routers from the response that are missing necessary
    HA fields (the HA interface and the HA state).

    This prevents the agent from every receiving a partially
    formed router.

    Co-Authored-By: Ann Kamyshnikova <email address hidden>

    Related-Bug: #1499647
    Closes-Bug: #1533441
    Change-Id: Iadb5a69d4cbc2515fb112867c525676cadea002b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/286065

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/286074

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/257059
Reason: Other patch should accomplish what this was doing for now. I added you as a Co-Author to the other Ann. Thanks!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/286065
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e2ca10e7bbe287cc6f3af791e968d31369d77ab4
Submitter: Jenkins
Branch: stable/liberty

commit e2ca10e7bbe287cc6f3af791e968d31369d77ab4
Author: Kevin Benton <email address hidden>
Date: Wed Feb 24 13:30:24 2016 -0800

    Filter HA routers without HA interface and state

    This patch adjusts the sync method to exclude any HA
    routers from the response that are missing necessary
    HA fields (the HA interface and the HA state).

    This prevents the agent from every receiving a partially
    formed router.

    Co-Authored-By: Ann Kamyshnikova <email address hidden>

    Related-Bug: #1499647
    Closes-Bug: #1533441
    Change-Id: Iadb5a69d4cbc2515fb112867c525676cadea002b
    (cherry picked from commit 046be0b8f30291cd029e6e97a4c6c5a1717a8bd1)

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

As far as I see this bug is fixed, I cannot reproduce it, can anyone clarify why it is in progress?

Revision history for this message
Assaf Muller (amuller) wrote :

I can still reproduce it, see https://review.openstack.org/#/c/284400/.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

We have https://bugs.launchpad.net/neutron/+bug/1550886 for that change, and in that case https://review.openstack.org/#/c/284400/ should have at least Related-bug for this one.

Actually, the scenario that I originally use to have errors (restart of l3 agents on all nodes during massive creation of routers) does not reproduce the original problem. I have http://paste.openstack.org/show/489153/ and it is mentioned in https://bugs.launchpad.net/neutron/+bug/1550886.

I'm not sure, do we this current bug track all problems related to test_ha_router failures?

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → John Schwarz (jschwarz)
Revision history for this message
Assaf Muller (amuller) wrote :

The bug title is 'test_ha_router fails intermittently', I'd consider this bug closed when the test passes reliably. Currently I'm only aware of the issue in https://bugs.launchpad.net/neutron/+bug/1550886, once that is solved, we can mark this as solved as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/293394

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/293394

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/286074
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9d924efe138fd8f18a3470134d2f1c04b925e88c
Submitter: Jenkins
Branch: stable/kilo

commit 9d924efe138fd8f18a3470134d2f1c04b925e88c
Author: Kevin Benton <email address hidden>
Date: Wed Feb 24 13:30:24 2016 -0800

    Filter HA routers without HA interface and state

    This patch adjusts the sync method to exclude any HA
    routers from the response that are missing necessary
    HA fields (the HA interface and the HA state).

    This prevents the agent from every receiving a partially
    formed router.

    Co-Authored-By: Ann Kamyshnikova <email address hidden>

    Related-Bug: #1499647
    Closes-Bug: #1533441
    Change-Id: Iadb5a69d4cbc2515fb112867c525676cadea002b
    (cherry picked from commit 046be0b8f30291cd029e6e97a4c6c5a1717a8bd1)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/257059
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9c3c19f07ce52e139d431aec54341c38a183f0b7
Submitter: Jenkins
Branch: master

commit 9c3c19f07ce52e139d431aec54341c38a183f0b7
Author: Kevin Benton <email address hidden>
Date: Thu Feb 18 03:48:29 2016 -0800

    Add ALLOCATING state to routers

    This patch adds a new ALLOCATING status to routers
    to indicate that the routers are still being built on the
    Neutron server. Any routers in this state are excluded in
    router retrievals by the L3 agent since they are not yet
    ready to be wired up.

    This is necessary when a router is made up of several
    distinct Neutron resources that cannot all be put
    into a single transaction. This patch applies this new
    state to HA routers while their internal HA ports and
    networks are being created/deleted so the L3 HA agent
    will never retrieve a partially formed HA router. It's
    important to note that the ALLOCATING status carries over
    until after the scheduling is done, which ensures that
    routers that weren't fully scheduled will not be sent to
    the agents.

    An HA router is placed in this state only when it is being
    created or converted to/from the HA state since this is
    disruptive to the dataplane.

    This patch also reverts the changes introduced in
    Iadb5a69d4cbc2515fb112867c525676cadea002b since they will
    be handled by the ALLOCATING logic instead.

    Co-Authored-By: Ann Kamyshnikova <email address hidden>
    Co-Authored-By: John Schwarz <email address hidden>

    APIImpact
    Closes-Bug: #1550886
    Related-bug: #1499647
    Change-Id: I22ff5a5a74527366da8f82982232d4e70e455570

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/305622

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/305774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by venkata anil (<email address hidden>) on branch: master
Review: https://review.openstack.org/301316
Reason: https://review.openstack.org/#/c/257059/ resolves same issue and already got merged.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/305622
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=36305c0c4f4ebf498020f5956e103832da75f8a9
Submitter: Jenkins
Branch: stable/mitaka

commit 36305c0c4f4ebf498020f5956e103832da75f8a9
Author: Kevin Benton <email address hidden>
Date: Thu Feb 18 03:48:29 2016 -0800

    Add ALLOCATING state to routers

    This patch adds a new ALLOCATING status to routers
    to indicate that the routers are still being built on the
    Neutron server. Any routers in this state are excluded in
    router retrievals by the L3 agent since they are not yet
    ready to be wired up.

    This is necessary when a router is made up of several
    distinct Neutron resources that cannot all be put
    into a single transaction. This patch applies this new
    state to HA routers while their internal HA ports and
    networks are being created/deleted so the L3 HA agent
    will never retrieve a partially formed HA router. It's
    important to note that the ALLOCATING status carries over
    until after the scheduling is done, which ensures that
    routers that weren't fully scheduled will not be sent to
    the agents.

    An HA router is placed in this state only when it is being
    created or converted to/from the HA state since this is
    disruptive to the dataplane.

    This patch also reverts the changes introduced in
    Iadb5a69d4cbc2515fb112867c525676cadea002b since they will
    be handled by the ALLOCATING logic instead.

    Co-Authored-By: Ann Kamyshnikova <email address hidden>
    Co-Authored-By: John Schwarz <email address hidden>

    APIImpact
    Closes-Bug: #1550886
    Related-bug: #1499647
    Change-Id: I22ff5a5a74527366da8f82982232d4e70e455570
    (cherry picked from commit 9c3c19f07ce52e139d431aec54341c38a183f0b7)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/314250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/285480
Reason: This patch was squashed in to https://review.openstack.org/#/c/317949/.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)
Download full text (36.9 KiB)

Reviewed: https://review.openstack.org/314250
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3bf73801df169de40d365e6240e045266392ca63
Submitter: Jenkins
Branch: master

commit a323769143001d67fd1b3b4ba294e59accd09e0e
Author: Ryan Moats <email address hidden>
Date: Tue Oct 20 15:51:37 2015 +0000

    Revert "Improve performance of ensure_namespace"

    This reverts commit 81823e86328e62850a89aef9f0b609bfc0a6dacd.

    Unneeded optimization: this commit only improves execution
    time on the order of milliseconds, which is less than 1% of
    the total router update execution time at the network node.

    This also

    Closes-bug: #1574881

    Change-Id: Icbcdf4725ba7d2e743bb6761c9799ae436bd953b

commit 7fcf0253246832300f13b0aa4cea397215700572
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Apr 21 07:05:16 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I9e930750dde85a9beb0b6f85eeea8a0962d3e020

commit 643b4431606421b09d05eb0ccde130adbf88df64
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Apr 19 06:52:48 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I52d7460b3265b5460b9089e1cc58624640dc7230

commit 1ffea42ccdc14b7a6162c1895bd8f2aae48d5dae
Author: OpenStack Proposal Bot <email address hidden>
Date: Mon Apr 18 15:03:30 2016 +0000

    Updated from global requirements

    Change-Id: Icb27945b3f222af1d9ab2b62bf2169d82b6ae26c

commit b970ed5bdac60c0fa227f2fddaa9b842ba4f51a7
Author: Kevin Benton <email address hidden>
Date: Fri Apr 8 17:52:14 2016 -0700

    Clear DVR MAC on last agent deletion from host

    Once all agents are deleted from a host, the DVR MAC generated
    for that host should be deleted as well to prevent a buildup of
    pointless flows generated in the OVS agent for hosts that don't
    exist.

    Closes-Bug: #1568206
    Change-Id: I51e736aa0431980a595ecf810f148ca62d990d20
    (cherry picked from commit 92527c2de2afaf4862fddc101143e4d02858924d)

commit eee9e58ed258a48c69effef121f55fdaa5b68bd6
Author: Mike Bayer <email address hidden>
Date: Tue Feb 9 13:10:57 2016 -0500

    Add an option for WSGI pool size

    Neutron currently hardcodes the number of
    greenlets used to process requests in a process to 1000.
    As detailed in
    http://lists.openstack.org/pipermail/openstack-dev/2015-December/082717.html

    this can cause requests to wait within one process
    for available database connection while other processes
    remain available.

    By adding a wsgi_default_pool_size option functionally
    identical to that of Nova, we can lower the number of
    greenlets per process to be more in line with a typical
    max database connection pool size.

    DocImpact: a previously unused configuration value
               wsgi_default_pool_size is now used to a...

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

What issue does this bug is supposed to fix now? I already asked this question but... I think we need clean all open bug and understand all known issues.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/liberty)

Change abandoned by John Schwarz (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/305774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/323232

Revision history for this message
John Schwarz (jschwarz) wrote :

As per comment #39, this can be closed - this bug report is mostly a tracker bug and I'm under most of the races that made test_ha_router fail are resolved.

Some other races are https://bugs.launchpad.net/neutron/+bug/1605285 and https://bugs.launchpad.net/neutron/+bug/1605282, but these can be addressed separately.

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/323232

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.