some L3 HA routers does not work

Bug #1757188 reported by Piotr Misiak
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

Pike
DVR + L3_HA
L2population enabled

Some of our L3 HA routers are not working correctly. They are not reachable from instances.
After deep investigation, I've found that "HA port tenant <tenant id>" ports are in state DOWN.
They are DOWN because they don't have binding information.
They don't have binding information because 'HA network tenant <tenant_id>' network is corrupted.

I mean it does not have provider:network_type and provider:segmentation_id parameters set.

The weird thing is that this network was OK and worked but in some point in time has been corrupted. I don't have any logs from this point in time.

For comparison working HA tenant network:

+---------------------------+----------------------------------------------------+
| Field | Value |
+---------------------------+----------------------------------------------------+
| admin_state_up | True |
| availability_zone_hints | |
| availability_zones | nova |
| created_at | 2018-02-16T16:52:31Z |
| description | |
| id | fa2fea5c-ccaa-4116-bb0c-ff59bbd8229a |
| ipv4_address_scope | |
| ipv6_address_scope | |
| mtu | 9000 |
| name | HA network tenant afeeb372d7934795b63868330eca0dfe |
| port_security_enabled | True |
| project_id | |
| provider:network_type | vxlan |
| provider:physical_network | |
| provider:segmentation_id | 35 |
| revision_number | 3 |
| router:external | False |
| shared | False |
| status | ACTIVE |
| subnets | 5cbc612d-13cf-4889-88fb-02d1debe5f8d |
| tags | |
| tenant_id | |
| updated_at | 2018-02-16T16:52:31Z |
+---------------------------+----------------------------------------------------+

and not working HA tenant network:

+---------------------------+----------------------------------------------------+
| Field | Value |
+---------------------------+----------------------------------------------------+
| admin_state_up | True |
| availability_zone_hints | |
| availability_zones | |
| created_at | 2018-01-26T12:24:15Z |
| description | |
| id | 6390c381-871e-4945-bfa0-00828bb519bc |
| ipv4_address_scope | |
| ipv6_address_scope | |
| mtu | 9000 |
| name | HA network tenant 3e88cffb9dbb4e1fba96ee72a02e012e |
| port_security_enabled | True |
| project_id | |
| provider:network_type | |
| provider:physical_network | |
| provider:segmentation_id | |
| revision_number | 5 |
| router:external | False |
| shared | False |
| status | ACTIVE |
| subnets | 4d579b00-c780-45ed-9bd8-4d3256fa8a42 |
| tags | |
| tenant_id | |
| updated_at | 2018-01-29T14:08:11Z |
+---------------------------+----------------------------------------------------+

I've found that all working networks have revision_number = 3 and all not working networks have revision_number = 5.

When HA network tenant network is corrupted ALL L3-HA routers in a particular tenant are not working.
Is there any way to fix this without removing all existing L3-HA routers in this tenant?

Unfortunately I can't find any code responsible for "HA network tenant" updating or modification so I hit a wall in my debugging process.

It is probable that network has been corrupted during some automatic network resources provisioning using Heat stack but I can't reproduce this.

Tags: l3-ha
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The HA network is created in [1]. I wonder if deleting the HA network or the HA port and restarting the master/standby agents will trigger a rebuild. It's been a while before I looked at this part of the code and I am not 100% sure what's in there anymore.

[1] http://git.openstack.org/cgit/openstack/neutron/tree/neutron/db/l3_hamode_db.py#n197

tags: added: l3-ha
Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

I've found that Heat has been doing a stack rollback when the HA network was corrupted. I will debug this more.

summary: - some L3 HA routrers does not work
+ some L3 HA routers does not work
Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

Guys, I've found the root cause and also commit which fixes this.

This bug is the same as this one: https://bugs.launchpad.net/neutron/+bug/1757188

Thanks

Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

I pasted the wrong URL. This is correct: https://bugs.launchpad.net/neutron/+bug/1732543

Changed in neutron:
status: New → Invalid
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Glad you got it sorted!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.