Disabling an ML2 type driver can leave orphaned DB records

Bug #1311470 reported by Kevin Benton on 2014-04-23
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Unassigned

Bug Description

If an ML2 type driver is disabled after segments have been allocated using that type driver, subsequent network deletions will not remove the DB records allocated by that type driver since the driver isn't there to release the segment[1]. These orphaned segments will then be unavailable for use if the type driver is re-enabled later.

1. https://github.com/openstack/neutron/blob/af89d74d2961db6a04572375150ad908c9e72e78/neutron/plugins/ml2/managers.py#L103

Tags: ml2 Edit Tag help
tags: added: ml2
Changed in neutron:
importance: Undecided → Low
status: New → Confirmed
importance: Low → Medium
navin (navinchandra-salunke) wrote :

I was just going through this bug.

On Openstack website, in ML2 description, Following warning has been given:

"Disabling a ML2 type driver and re-enabling it later may lead to database inconsistencies if ML2 is reconfigured without support for that type. "

see:
http://docs.openstack.org/admin-guide-cloud/content/ml2_scenarios.html

We can solve this with some work around. But before jump to find out solution I want to know your view.

Changed in neutron:
status: Confirmed → Incomplete
Kevin Benton (kevinbenton) wrote :

This shouldn't have been set to Incomplete. There is no missing information here.

This bug was created for the ML2 team to follow-up with the REVISIT line in the referenced code. I updated it to 'confirmed' and then if nobody wants to put a fix directly in ML2 it can be updated to Won't Fix.

Changed in neutron:
status: Incomplete → Confirmed
Changed in neutron:
assignee: nobody → navin (navinchandra-salunke)
navin (navinchandra-salunke) wrote :

I have gone through the code and found following approaches to fix this:
1) Irrespective of what drivers are enabled we can have all the drivers in the driver list. By this way release (delete) function will be called accordingly ( by respective driver).

2) At database side, we need to alter the table with cascade delete option. (Still looks like complicated to me)

3) Writing "AFTER DELETE " Trigger on ml2_network_allocations table of neutron_ml2 database such a way that respective entry from ml2_gre_allocations and ml2_vlan_allocations... will get deleted.

Currently I'm locally trying 3rd approach. Please let me know your suggestions.

navin (navinchandra-salunke) wrote :

I have written trigger on table: ml2_network_allocations. This trigger is getting activated as soon as we delete any row from ml2_network_allocations table. it resets the value of "allocated" field of table ml2_gre_allocation ( I am trying with gre first). So trigger has been written correctly.
But it is not getting activate when I am trying to delete through "neutron net-delete <netid>" API.
I am also debugging the python code to make sure query fired by API should be proper. But no luck yet.
Please let me know if you have any pointers.

Database: neutron_ml2

navin (navinchandra-salunke) wrote :

Trigger was not getting fired because record was getting deleted due to cascade delete operation. Currently, cascaded foreign key actions do not activate triggers in MySQL InnoDB . It is not supported by MySQL.
To work around for this, I have use Cascade trigger concept. (ie chain of triggers).
We have Parent table and child table with cascade delete implemented so that respective entry get deleted from the child.
as per this work around, We need to remove this cascade delete constraint and need to add trigger which do exactly same thing. We also need to write one more trigger on child table which will update the segment id of respective type driver table. This way, Whenever master record get deleted the child entry will be deleted because of the first trigger. This delete operation will activate trigger written on Child table and update the respective "type driver table" by resetting value to 0.

I have successfully implemented and tested this approach locally.

References:

Database : neutron_ml2
Parent table : networks
Child table : ml2_network_segments
type driver tables : ml2_vlan_allocations, ml2_vxlan_allocations, ml2_gre_allocations (depends on
                                                                                        network type).

Let me know your suggestion. If it seems fine then I can proceed to fix this on repository for review.

I would push the fix up to Gerrit so people can see the code and begin a
discussion there.
On Jun 19, 2014 8:05 AM, "navin" <email address hidden>
wrote:

> Trigger was not getting fired because record was getting deleted due to
> cascade delete operation. Currently, cascaded foreign key actions do not
> activate triggers in MySQL InnoDB . It is not supported by MySQL.
> To work around for this, I have use Cascade trigger concept. (ie chain of
> triggers).
> We have Parent table and child table with cascade delete implemented so
> that respective entry get deleted from the child.
> as per this work around, We need to remove this cascade delete constraint
> and need to add trigger which do exactly same thing. We also need to write
> one more trigger on child table which will update the segment id of
> respective type driver table. This way, Whenever master record get deleted
> the child entry will be deleted because of the first trigger. This delete
> operation will activate trigger written on Child table and update the
> respective "type driver table" by resetting value to 0.
>
> I have successfully implemented and tested this approach locally.
>
> References:
>
> Database
> : neutron_ml2
> Parent table :
> networks
> Child table
> : ml2_network_segments
> type driver tables :
> ml2_vlan_allocations, ml2_vxlan_allocations, ml2_gre_allocations (depends on
>
> network type).
>
> Let me know your suggestion. If it seems fine then I can proceed to fix
> this on repository for review.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1311470
>
> Title:
> Disabling an ML2 type driver can leave orphaned DB records
>
> Status in OpenStack Neutron (virtual network service):
> Confirmed
>
> Bug description:
> If an ML2 type driver is disabled after segments have been allocated
> using that type driver, subsequent network deletions will not remove
> the DB records allocated by that type driver since the driver isn't
> there to release the segment[1]. These orphaned segments will then be
> unavailable for use if the type driver is re-enabled later.
>
> 1.
>
> https://github.com/openstack/neutron/blob/af89d74d2961db6a04572375150ad908c9e72e78/neutron/plugins/ml2/managers.py#L103
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1311470/+subscriptions
>

navin (navinchandra-salunke) wrote :

Following is the sql script which has solved the above problem. I could not fix this because I am not getting the source for SQL script :

SQL Script:
========

ALTER TABLE `ml2_network_segments` DROP FOREIGN KEY `ml2_network_segments_ibfk_1`;

DROP TRIGGER IF EXISTS trig_deallocate_network_segment;
DELIMITER $$
CREATE DEFINER = `root`@`devstack` TRIGGER `trig_deallocate_network_segment` AFTER DELETE ON `networks` FOR EACH ROW
         BEGIN
             DELETE FROM neutron_ml2.ml2_network_segments WHERE neutron_ml2.ml2_network_segments.network_id=old.id;
         END$$
DELIMITER ;

DROP TRIGGER IF EXISTS trig_deallocate_type_segment;
DELIMITER $$
CREATE DEFINER = `root`@`devstack` TRIGGER `trig_deallocate_type_segment` AFTER DELETE ON `ml2_network_segments` FOR EACH ROW
BEGIN
    IF OLD.network_type = 'gre' THEN
        UPDATE neutron_ml2.ml2_gre_allocations SET neutron_ml2.ml2_gre_allocations.allocated=0
                      WHERE OLD.segmentation_id=neutron_ml2.ml2_gre_allocations.gre_id;
    ELSEIF OLD.network_type = 'vlan' THEN
        UPDATE neutron_ml2.ml2_gre_allocations SET neutron_ml2.ml2_vlan_allocations.allocated=0
                      WHERE OLD.segmentation_id=neutron_ml2.ml2_vlan_allocations.vlan_id;
    ELSEIF OLD.network_type = 'vxlan' THEN
        UPDATE neutron_ml2.ml2_vxlan_allocations SET neutron_ml2.ml2_vxlan_allocations.allocated=0
                     WHERE OLD.segmentation_id=neutron_ml2.ml2_vxlan_allocations.vxlan_vni;
    END IF;
END$$
DELIMITER ;
------------------------------------------------------------------------------------------------------------------------------------------------------------

Robert Kukura (rkukura) wrote :

I'm wondering if it would be sufficient just to somehow warn operators to not unconfigure a type driver while network segments of that type exist. Maybe documenting a DB query that can be used to identify the networks using that type would be helpful, so that those networks can be deleted first.

Kevin Benton (kevinbenton) wrote :

Do you think it would be okay to add a DB scan on the initialization of
neutron that checks for unsupported types and causes neutron to quit with a
warning to get rid of them?

Either that or we could automatically re-allocate segments for them from a
supported type. (I like this idea the more)

The current state is kinda painful. We have run into this a couple of times
using deployment tools that automatically created a couple of networks
before we adjusted the type drivers. We then just resort to a monster sql
query that rips out the segments, ports, networks, etc associated with the
old type.
On Aug 13, 2014 8:31 AM, "Robert Kukura" <email address hidden> wrote:

> I'm wondering if it would be sufficient just to somehow warn operators
> to not unconfigure a type driver while network segments of that type
> exist. Maybe documenting a DB query that can be used to identify the
> networks using that type would be helpful, so that those networks can be
> deleted first.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1311470
>
> Title:
> Disabling an ML2 type driver can leave orphaned DB records
>
> Status in OpenStack Neutron (virtual network service):
> Confirmed
>
> Bug description:
> If an ML2 type driver is disabled after segments have been allocated
> using that type driver, subsequent network deletions will not remove
> the DB records allocated by that type driver since the driver isn't
> there to release the segment[1]. These orphaned segments will then be
> unavailable for use if the type driver is re-enabled later.
>
> 1.
>
> https://github.com/openstack/neutron/blob/af89d74d2961db6a04572375150ad908c9e72e78/neutron/plugins/ml2/managers.py#L103
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1311470/+subscriptions
>

Sukhdev Kapur (sukhdev-8) wrote :

This was discussed in ML2 meeting and conclusion was reached that we check the DB for segments for without configured drivers. If one is found, we will fail the initialization with appropriate warning as well as steps for the operator to remove those segments from the DB.

Kevin Benton (kevinbenton) wrote :

Is there a reason that re-allocating a segment from a configured type driver was considered bad? I can implement this feature if there isn't some fundamental limitation with it.

The SQL to clean up the issue is pretty ugly and it gets rid of all of the existing networks. An auto-reallocation could be a good first step to actually providing users a way to transition between different underlying types.

Edgar Magana (emagana) wrote :

On the Docs side. This has been already fixed by: https://bugs.launchpad.net/openstack-manuals/+bug/1307133
Nothing else is needed.

Shiv Haris (shh) on 2014-10-15
Changed in neutron:
milestone: none → kilo-1
Kyle Mestery (mestery) on 2014-12-16
Changed in neutron:
milestone: kilo-1 → kilo-2
Kyle Mestery (mestery) on 2015-02-03
Changed in neutron:
milestone: kilo-2 → kilo-3
Kevin Benton (kevinbenton) wrote :

Hi Navin,

Are you still intending on working on a solution for this? If not, please un-assign the bug from yourself so we can see that nobody is actively working on it.

navin (navinchandra-salunke) wrote :

Hi Kevin,
Sorry for keep it assigning for long time. I have changed it to nobody.

Changed in neutron:
assignee: navin (navinchandra-salunke) → nobody
Kevin Benton (kevinbenton) wrote :

No problem.

Kyle Mestery (mestery) on 2015-03-19
Changed in neutron:
milestone: kilo-3 → kilo-rc1
Neil Jerram (neil-jerram) wrote :

I have reviewed this bug and could work it; but there doesn't seem to be any consensus yet on what the correct fix should be.

- Navin suggested (IIUC) that a type driver could still be used to delete segments after it was disabled. I.e., I guess, that 'disabled' effectively means disabled only for creating new segments.

- Kevin has suggested an audit that takes place at Neutron initialization time, and perhaps tries to reassign affected segments to another type driver.

- My own thought on first reading was that affected segments should be deleted when the type driver that created them was disabled. But no one else has suggested that, so perhaps there's a reason why it doesn't make sense.

Something that I'm not clear about - but I guess is important - is precisely how a type driver can be disabled. One way, of course, is to stop and restart the Neutron server with modified type_drivers line in ml2_conf.ini. Is that in fact the only way? If so I guess that would explain Kevin's suggestion of an initialization time fix, and also why it doesn't make sense to do something 'when' a driver is disabled.

Please could interested parties chip in here about how they think this should be fixed?

Kevin Benton (kevinbenton) wrote :

Your understanding of how they can be disabled is correct. There is no API call or anything like that to do it. The only way is via the config file change and restarting the server.

no code, so don't think this is sensible as an RC1 target, please post code and we'll reassess.

Changed in neutron:
milestone: kilo-rc1 → none
Neil Jerram (neil-jerram) wrote :

Thanks, Kevin.

Armando - I took a look at this bug because it was listed at https://launchpad.net/neutron/+milestone/kilo-rc1 but had no one assigned to it. I assumed that meant it was considered important, but if not I'm happy to drop it and look elsewhere for useful work.

Sreekumar S (sreesiv) on 2015-10-30
Changed in neutron:
assignee: nobody → Sreekumar S (sreesiv)
Sreekumar S (sreesiv) wrote :

I am analyzing this issue now and would like to provide a solution other than the database trigger/database script fix for the admin.
I was a bit confused with the description provided for this and so will post the steps I used to recreate this.

I am using a devstack env and by its configuration, it configures a neutron nw with
tenant_network_types = gre &
type_drivers = local,flat,vlan,gre,vxlan

- devstack created a private network for me with
Network Type: gre
Physical Network: -
Segmentation ID: 93

- I changed /etc/neutron/plugins/ml2/ml2_conf.ini to
tenant_network_types = vlan
type_drivers = local,flat,vlan

- restarted all q-* services manually

- via Horizon I deleted the private network of type 'gre'. In this case no gre type driver was loaded. So segment deletion hasn't happened properly

- Now I create a private network with vlan type
`neutron net-create --provider:network_type=vlan --provider:segmentation_id=93 --provider:physical_network=physnet1 private`
I kept the ID to 93, just so as to confirm there is no cross dependency with IDs among types!!!

- Above step goes fine and network is created with type vlan

- I changed /etc/neutron/plugins/ml2/ml2_conf.ini to
tenant_network_types = gre
type_drivers = local,flat,gre

- restarted all q-* services manually

- via Horizon I deleted the private network of type 'vlan'. In this case no vlan type driver was loaded. So segment deletion hasn't happened properly

- Now I create a private network with gre type "re-enabled"
`neutron net-create --provider:network_type=gre --provider:segmentation_id=93 private`

- The command above fails with message...
"Unable to create the network. The tunnel ID 93 is in use."

- I changed /etc/neutron/plugins/ml2/ml2_conf.ini to
tenant_network_types = gre,vlan
type_drivers = local,flat,gre,vlan

- restarted all q-* services manually

- Now I create a private network with vlan type "re-enabled"
`neutron net-create --provider:network_type=vlan --provider:segmentation_id=93 --provider:physical_network=physnet1 private`

- The command above fails with message...
"Unable to create the network. The VLAN 93 on physical network physnet1 is in use."

- Re-check gre type creation once more
`neutron net-create --provider:network_type=gre --provider:segmentation_id=93 private`
That also fails with message "Unable to create the network. The tunnel ID 93 is in use."

I guess this is the issue that is described with this bug, Can anyone please confirm?

Sreekumar S (sreesiv) wrote :

Somehow Gerrit is not posting any comments for the my submitted changes...

Please review this patch.
Patch for 'master' branch: https://review.openstack.org/245137

Changed in neutron:
status: Confirmed → In Progress

I wonder if change [1] (to be expanded) can come to the rescue after all. We could associate the network id to the segment tables, and leverage a foreign key constraint with delete on cascade that will wipe the segments automatically on network deletion. This is basically what is commented in [2].

That said, I really think that we're over-engineering a solution to a problem that hardly occurs in practice.

[1] https://review.openstack.org/#/c/255614/
[2] https://github.com/openstack/neutron/blob/af89d74d2961db6a04572375150ad908c9e72e78/neutron/plugins/ml2/managers.py#L105

Sreekumar S (sreesiv) wrote :

Here's my opinion on this... :-)

I understand your point on expanding [1] so that we could have something to hold onto for doing [2].
But when I started with this bug, my intention was the exact opposite and I've mentioned that in the comments & commit msg.

https://bugs.launchpad.net/neutron/+bug/1311470/comments/20

"""In this proposed solution, there are no foreign key based database
triggers/database scripts for the admin. Currently the segments are
released by the type driver loaded specific to that type. So this
change intends to use the type driver to release the segment."""

The reason for me not liking a db fix is explained below...

Note: These are just my thoughts, and please consider that I'm the least experienced in the group :-)

1) I don't see 'data in the DB' as the only potential dependency associated with a type driver. Imagine a type driver (may be from a vendor) which has to externally setup something (may be a device, or anything else which only it knows how to do) on creation, updation and also deletion. So those operations become very much type driver specific. IMHO the driver itself should delete the segment and that it the correct way.

2) Now when we've given an option for the admin to add/remove drivers using "type_drivers", what we've created is a monster not a mosquito! :-) It gives him all the power to configure that option anyway he wants, he can even decide to take drivers away for networks that are 'in operation', for whatever reasons (obsoleting, driver issues etc). This is the root cause of the issue.
Now to make him understand not to do that, and educate him to how to achieve the same with policies and all, is another point. But the reality is that there is an option for him to achieve that and it's quite easy to do

3) Rarity of occurrence may be a fact here. But IMO, when something like this happens it will be a big pain (https://bugs.launchpad.net/neutron/+bug/1311470/comments/9). And when it happens to a vendor who has something to do like what is mentioned (1) it will be worse.
If that is rare or not, I think that debate will never end :)

Even if that happens, my question is what's wrong with providing an option which is not otherwise used, and is empty by default. To be honest I really don't think this as over-engineering. Yeah my initial plan to detect the orphan types was a flawed over-engineering effort.

To summarise:
The fact that there is 'type_drivers' and it has the potential to create a mess as in (1) and (2) (although deemed rare) justify the provision of 'disabled_type_drivers' in case such a problem happens. And it's a simple implementation!

[1] https://review.openstack.org/#/c/255614/
[2]shttps://github.com/openstack/neutron/blob/af89d74d2961db6a04572375150ad908c9e72e78/neutron/plugins/ml2/managers.py#L105

Sreekumar S (sreesiv) wrote :

I had some more thoughts on this...

I now feel, there should be only 'disabled_type_drivers' and the 'type_drivers' option should be deprecated and taken out in future.
The full list of type drivers enabled for neutron can be a hard coded list. Vendors/anyone can add their driver entry to this list when they submit their drivers to the main tree.

The operator can configure 'disabled_type_drivers' to include the drivers/types he wants to disable. By default this list will be empty. Drivers which are in the main list (hard coded) and which are not there in the 'disabled_type_drivers' option will be loaded, registered, and used for creation/updation. When someone deletes a network with a disabled driver, it will get deleted with the driver from the 'disabled_type_drivers' list.

There will be no confusion this way, and only one list for the admin to think of.

Looking for inputs/suggestions.

Configuration options is a sore point in OpenStack in general and I personally do not think we should promote an uncontrolled sprawl of static config options, especially to solve this class of problems. As a matter of fact, if the desire is to make type driver a dynamic property of Neutron deployments, then this should be driven via an admin API, rather than a config option which is supposed to be empty only until the state of the deployment change: this is very awkward to say the least.

To your point about DB triggers not being enough, there's a plethora of integration mechanisms to let 3rd parties know about state changes; direct method calls is a tightly coupled mechanism that should be used as last resort. As for the specific issue with type drivers, I believe that if an admin wants to decommission type drivers, he/she should not do so without contemplating the state of his/her cloud. If there are segments in use, then he/she must understand the consequences. As for the situation reported in this bug, the consequences can be dealt with with a simple DB cleanup, because the type driver interface is so trivial that nothing esoteric can happen during the segment release operation, even if a vendor type driver wanted to.

Btw, if you were to figure out a way not to expose the config variable and yet manage to provide a mechanism for releasing the segments, I might be fine with that ;)

Sreekumar S (sreesiv) wrote :

On that last note, I am trying another amendment PS #14 :-)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/245137
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
status: In Progress → Incomplete
assignee: Sreekumar S (sreesiv) → nobody
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers