Duplicate interfaces causing rack controller not to register

Bug #1811222 reported by Björn Tillenius
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse

Bug Description

I did a fresh install of 2.5.1-7489-g2f25a2cc0-0ubuntu1~18.04.1 using the maas package.

After the install finished, there's a notice that once rack controller isn't connected to the region. Looking at the controller page, it says that rackd is 25% connected.

I see this in the regiond.log:

  maasserver.models.interface.MultipleObjectsReturned: get() returned more than one PhysicalInterface -- it returned 2!

I see that somehow I do have two physical interfaces with the same mac address and name. Not sure how it got there, though:

maasdb=# select id, mac_address, name, node_id, type, vlan_id from maasserver_in
terface;
 id | mac_address | name | node_id | type | vlan_id
----+-------------------+-----------------+---------+----------+---------
  1 | 00:16:3e:7c:bf:ee | eth0 | 1 | physical | 5001
  2 | 00:16:3e:a0:b9:39 | eth1 | 1 | physical | 5002
  4 | 52:54:00:b0:5e:8d | virbr0 | 1 | bridge | 5003
  5 | 6e:65:9f:c2:31:e2 | br0 | 1 | bridge | 5001
  6 | 66:4b:16:ec:b3:d0 | br1 | 1 | bridge | 5002
  3 | 52:54:00:fe:18:d6 | mpqemubr0-dummy | 1 | physical | 5004
  7 | 52:54:00:fe:18:d6 | mpqemubr0 | 1 | bridge | 5004
  8 | 00:16:3e:7c:bf:ee | eth0 | 1 | physical |
  9 | 00:16:3e:a0:b9:39 | eth1 | 1 | physical |
 10 | 52:54:00:fe:18:d6 | mpqemubr0-dummy | 1 | physical |
(10 rows)

Related branches

Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
status: Triaged → Confirmed
milestone: none → 2.5.1
no longer affects: maas/2.6
Changed in maas:
milestone: 2.5.1 → 2.6.0
Revision history for this message
Michał Ajduk (majduk) wrote :

I've hit the same issue but in MAAS 2.6.0 (7802-g59416a869-0ubuntu1~18.04.1) configured without MAAS HA. The uploaded logs are the same in my case.

Revision history for this message
Michał Ajduk (majduk) wrote :

MAAS interface list in the database, note duplicates:

maasdb=# select id, mac_address, name, node_id, type, vlan_id from maasserver_interface;
 id | mac_address | name | node_id | type | vlan_id
----+-------------------+------------+---------+----------+---------
  1 | 4c:77:6d:99:30:da | eno1 | 1 | physical | 5001
  2 | 4c:77:6d:99:30:db | eno2 | 1 | physical | 5002
  3 | 3c:fd:fe:d5:87:10 | enp136s0f0 | 1 | physical | 5003
  4 | 3c:fd:fe:d5:87:11 | enp136s0f1 | 1 | physical | 5004
  5 | 3c:fd:fe:d5:87:12 | enp136s0f2 | 1 | physical | 5005
  6 | 3c:fd:fe:d5:87:13 | enp136s0f3 | 1 | physical | 5006
  8 | 3c:fd:fe:d5:71:49 | enp175s0f1 | 1 | physical | 5008
  9 | 3c:fd:fe:d5:71:4a | enp175s0f2 | 1 | physical | 5009
 10 | 3c:fd:fe:d5:71:4b | enp175s0f3 | 1 | physical | 5010
 11 | 3c:fd:fe:bb:19:f8 | enp25s0f0 | 1 | physical | 5011
 12 | 3c:fd:fe:bb:19:f9 | enp25s0f1 | 1 | physical | 5012
 13 | 3c:fd:fe:bb:15:3c | enp94s0f0 | 1 | physical | 5013
 14 | 3c:fd:fe:bb:15:3d | enp94s0f1 | 1 | physical | 5014
  7 | 3c:fd:fe:d5:71:48 | enp175s0f0 | 1 | physical | 5003
 15 | fa:6a:da:46:4b:8c | bond1 | 1 | bond | 5003
 16 | 16:03:d1:99:8c:b5 | broam | 1 | bridge | 5003
 17 | 4c:77:6d:99:30:da | eno1 | 1 | physical |
 18 | 4c:77:6d:99:30:db | eno2 | 1 | physical |
 19 | 3c:fd:fe:d5:87:10 | enp136s0f0 | 1 | physical |
 20 | 3c:fd:fe:d5:87:11 | enp136s0f1 | 1 | physical |
 21 | 3c:fd:fe:d5:87:12 | enp136s0f2 | 1 | physical |
 22 | 3c:fd:fe:d5:87:13 | enp136s0f3 | 1 | physical |
 23 | 3c:fd:fe:d5:71:48 | enp175s0f0 | 1 | physical |
 24 | 3c:fd:fe:d5:71:49 | enp175s0f1 | 1 | physical |
 25 | 3c:fd:fe:d5:71:4a | enp175s0f2 | 1 | physical |
 26 | 3c:fd:fe:d5:71:4b | enp175s0f3 | 1 | physical |
 27 | 3c:fd:fe:bb:19:f8 | enp25s0f0 | 1 | physical |
 28 | 3c:fd:fe:bb:19:f9 | enp25s0f1 | 1 | physical |
 29 | 3c:fd:fe:bb:15:3c | enp94s0f0 | 1 | physical |
 30 | 3c:fd:fe:bb:15:3d | enp94s0f1 | 1 | physical |

Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
milestone: 2.6.0 → 2.7.0alpha1
status: Confirmed → In Progress
no longer affects: maas/2.5
no longer affects: maas/2.6
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Björn Tillenius (bjornt) wrote :

Blake, could you expand a bit on the fix you made? I see that you added an index, but what code paths were creating interface in parallel?

I'm curious, since one of the entries have a VLAN id, and one doesn't. So what code paths were creating that, and how do we ensure that the right entry gets added?

Revision history for this message
Blake Rouse (blake-rouse) wrote :

The code path is that 2 controller processes that could be a regiond process or a rackd process can start a registration at the same time. The validation check was done inside the code and not at the database level for the physical interface 1 mac address.

So when 2 threads run and both create the same physical interface with the same mac address at the same time the database was not enforcing the data consistency as it should have, well because it didn't know about it. If you look at the MP you can see that this case is specifically tested, and if you remove the unique index that test would fail.

Now in the case that 2 threads are running at the same time one of them will get a IntegrityError and the whole process will be retried and on the retry the registration process will notice that a physical interface already exists and do the correct thing. The code path already handles the case where a physical all ready exists for the same controller (that is the case here on the same controller).

Revision history for this message
Björn Tillenius (bjornt) wrote :

Sure, I get why we won't error out anymore.

But why was one physical interface created with a vlan id, and the other without? Will the interface that we create now have a vlan id, or not?

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Storage of the interface information happens in 2 stages for registration.

Stage 1 is that just plain physical interfaces are created with mac addresses but no links to any vlans or IP addresses.

Then Stage 2 after network discovery has ran on the controller for a moment is updated into the database that is where the VLAN linkage with IP addresses and everything is performed.

So I believe what you are seeing is where both threads completed stage 1, but only 1 of the threads did stage 2. Which is correct as only 1 thread should do stage 2, but the stage 1 threads already created the collision causing stage 2 saving to save.

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.