[2.0] MaaS 2.0 BMC information not removed when nodes are removed

Bug #1586555 reported by Bert JW Regeer on 2016-05-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Jeffrey C Jones

Bug Description

We ran into a couple of different failures that led to something interesting bugs:

1. We had provision the original PXE boot network as a /26 (10.189.69.0/26), with Juju and LXC containers for OpenStack we needed to grow this network to a /25
2. Our BMC network was in the second /26 (10.189.69.64/26) next to the first that we wanted to grow
3. We changed the DHCP for the second /26 and moved it up one /26 (to 10.189.69.128/26) (DHCP outside of MaaS)
4. We removed all machines from MaaS by deleting them, and updated the subnet in MaaS to 10.189.69.0/25
5. We one by one powered each machine up, and commissioned it in MaaS

Two weeks later:

6. We mark a machine that won't deploy as broken, it had IP 10.189.69.58
7. We attempt to re-deploy all machines (18), and notice that as soon as we have deployed 9 machines, the 10th one won't deploy due to failure to allocate an IP address. This was mistakingly attributed to assigning multiple auto assigned networks in https://bugs.launchpad.net/maas/+bug/1586540
8. We assumed it had to do with the trying to set more subnet's on the various bond interfaces, but further inquiry found:

In the database in table maasserver_staticipaddress we noticed that 10.189.69.68 was already assigned, but was not assigned a subnet_id.

Attempting to delete it, there was a foreign key constraint on maasserver_bmc. It is at this point that I noticed that maasserver_bmc had a whole range of entries that were not referenced at all in maasserver_node.

After running the following:

DELETE FROM maasserver_staticipaddress WHERE id IN (SELECT ip_address_id FROM maasserver_bmc WHERE maasserver_bmc.id NOT IN (SELECT bmc_id FROM maasserver_node WHERE bmc_id IS NOT NULL));

DELETE FROM maasserver_bmc WHERE id NOT IN (SELECT bmc_id FROM maasserver_node WHERE bmc_id IS NOT NULL);

DELETE FROM maasserver_bmcroutablerackcontrollerrelationship WHERE bmc_id NOT IN (SELECT id FROM maasserver_bmc);

(Yes, sub-queries, I'm sure there is a better way to do it... :-P)

We were able to deploy the rest of the nodes. We then released all nodes not marked broken and noticed that a single IP address (10.189.69.58) was still in use. Once we marked the node as fixed, deployed and then immediately released it, that IP was released, now MaaS started numbering machines from the beginning of the range (10.189.69.7 was next available) rather than at 10.189.69.59 (next available after the broken one).

It seems that BMC information is not properly removed when a node is deleted from MaaS which when you change network ranges and have possible overlap becomes an issue.

Related branches

Dave Chiluk (chiluk) on 2016-06-01
tags: added: sts
Changed in maas:
importance: Undecided → Critical
milestone: none → 2.0.0
summary: - MaaS 2.0 BMC information not removed when nodes are removed
+ [2.0b5] MaaS 2.0 BMC information not removed when nodes are removed
summary: - [2.0b5] MaaS 2.0 BMC information not removed when nodes are removed
+ [2.0] MaaS 2.0 BMC information not removed when nodes are removed
Changed in maas:
assignee: nobody → Jeffrey C Jones (trapnine)
status: New → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers