[2.x] 2 out of 3 rack controller interfaces are missing links

Bug #1722646 reported by Greg Lutostanski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse

Bug Description

In order to setup networks in a repeatable manner. We use a script similar to https://github.com/lutostag/maaster.

On first setup the dump of the controllers interfaces is here:
https://pastebin.canonical.com/200275/

We can see that only one of the controllers (frxsah) controller had any ips set up for any interfaces. This is not what should be the case as all controllers are connected to that same subnet and have static ips set on it.

Logs of all infra nodes available at https://bugs.launchpad.net/maas/+bug/1722578/+attachment/4966827/+files/logs-2017-10-10-15.58.12.tar

Related branches

tags: added: cdo-qa-blocker foundations-engine
removed: foundation-engine
Changed in maas:
milestone: none → 2.3.0beta3
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Greg,

I dont have actual context on what do you want to make MAAS do, and what your script is actually doing. What I understand is that rack controllers don't have interfaces configured, but that's about it. Is this that you want to configure rack controller interfaces via MAAS ? or.... ?

As such Can you please clarify:

1. What do you want to do / What are you doing?
2. How do you do it ? (i know the script, but what does the script really do).
3. What is the expected result?

I don't know if this is related, but

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

some more context:
a little more detail on the bug: http://paste.ubuntu.com/25715844/

the context is we're configuring MAAS to with our subnets and vlans (MAAS can't discover them all because the rack controller doesn't have interfaces on them), and turning on dhcp in the right places. To turn on dhcp in the right places, we read the list rack controllers to find those rack controllers that have links on our oam network (10.245.208.0/20). All three rack controllers have interfaces up on that network, but MAAS is only creating links for 1 of the 3 rack controllers.

Changed in maas:
status: Incomplete → New
summary: - interfaces do not have ips
+ 2 out of 3 rack controller interfaces are missing links
Revision history for this message
Mike Pontillo (mpontillo) wrote : Re: 2 out of 3 rack controller interfaces are missing links

Can you provide the output of "sudo maas-rack support-dump --networking" for each controller? Thanks in advance.

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Reproduced again last night, except none of the rack controllers have subnets linked in this case.

Here is the support-dump requested. http://paste.ubuntu.com/25719855/

Here is a dump of reading the rack controller interfaces via the API:
http://paste.ubuntu.com/25719860/

Changed in maas:
status: Incomplete → New
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Mike Pontillo (mpontillo)
Revision history for this message
Andres Rodriguez (andreserl) wrote :
summary: - 2 out of 3 rack controller interfaces are missing links
+ [2.x] 2 out of 3 rack controller interfaces are missing links
Changed in maas:
milestone: 2.3.0beta3 → 2.3.0beta4
Changed in maas:
milestone: 2.3.0beta4 → 2.3.0rc1
Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Chris Gregan (cgregan) wrote :

@Andres
Is there a reason for this being set to incomplete?

Revision history for this message
Blake Rouse (blake-rouse) wrote :

@Chris,

I don't know why its set to incomplete? But are you able to reproduce this with latest 2.3 beta3? This might (hoping), it was fixed in some other HA work I did.

Changed in maas:
milestone: 2.3.0rc1 → 2.3.0rc2
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

yes, I've reproduced it with 2.3.0~beta3 (6368-g03ca7f4-0ubuntu1~16.04.1)

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

logs from the latest repro

Revision history for this message
Andres Rodriguez (andreserl) wrote :

It happened to me on a single region/rack, I see this issue in the logs, which Mike believes is related.

2017-11-03 22:47:28 maasserver.listener: [info] Listening for database notifications.
2017-11-03 22:47:28 twisted.internet.defer: [critical] Unhandled error in Deferred:
2017-11-03 22:47:28 twisted.internet.defer: [critical]

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 434, in errback
    self._startRunCallbacks(fail)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
    _inlineCallbacks(r, g, deferred)
--- <exception caught here> ---
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3/dist-packages/maasserver/regiondservices/reverse_dns.py", line 42, in startService
    RegionController.objects.get_running_controller)
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 875, in callInContext
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 232, in wrapper
    result = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/models/node.py", line 660, in get_running_controller
    return self.get(system_id=get_maas_id())
  File "/usr/lib/python3/dist-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/db/models/query.py", line 334, in get
    self.model._meta.object_name
maasserver.models.node.DoesNotExist: RegionController matching query does not exist.
2017-11-03 22:47:28 maasserver.regiondservices.active_discovery: [info] Active network discovery: Discovery interval set to 10800 seconds.
2017-11-03 22:47:29 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2017-11-03 22:47:29 maasserver.listener: [error] Unable to connect to database: dictionary changed size during iteration

Revision history for this message
Andres Rodriguez (andreserl) wrote :

With the above comment, something else I noticed is that:

1. Enabled DHCP
2. DHCPd wasn't running. Service status shows it as off. Subnets listing shows it as on. See screenshots.

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

To address the above i did:

1. stopped maas-rackd
2. restarted maas-regiond (interfaces are now populated).
3. restarted maas-regiond (dhcpd is now enabled).

Revision history for this message
Mike Pontillo (mpontillo) wrote :

This backtrace is present in all the regiond.log files:

http://paste.ubuntu.com/25881891/

This indicates to me that perhaps this exception (in the reverse DNS service) is taking down the entire networks monitoring service, and causing the interfaces to fail to update until the second time the region is started.

summary: - [2.x] 2 out of 3 rack controller interfaces are missing links
+ [2.x] "Unhandled error in Deferred" via the reverse DNS service can
+ crash network monitoring
Changed in maas:
status: New → Triaged
Revision history for this message
Blake Rouse (blake-rouse) wrote : Re: [2.x] "Unhandled error in Deferred" via the reverse DNS service can crash network monitoring

@mpontillo

Looking at the code the ReverseDNSService and the NetworkMonitoringService do not directly connect to each other in the reactor. So the ReverseDNSServer crashing (which is not good!), but that does seem to have any correlation of taking down the NetworkMonitoringService.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

@blake_r, I came to the same conclusion as well after looking deeper at this. I've filed a new bug, #1730474, whose fix I hope will resolve this class of issues.

summary: - [2.x] "Unhandled error in Deferred" via the reverse DNS service can
- crash network monitoring
+ [2.x] 2 out of 3 rack controller interfaces are missing links
Revision history for this message
Andres Rodriguez (andreserl) wrote :

The fix for #1730474 didn't seem to have fixed this issue for me:

1. Installed region/rack
2. fabrics/subnets were discovered, rack controller was updated but the IP addresses
3. controller listed showed 'unknown' version
4. restarted maas-regiond and things fixed themselves

regiond.log before restart: http://paste.ubuntu.com/25906216/
rackd.log before restart:http://paste.ubuntu.com/25906217/
maas.log before restart: http://paste.ubuntu.com/25906220/

Revision history for this message
Andres Rodriguez (andreserl) wrote :

postgresql log after fresh install:

http://paste.ubuntu.com/25906312/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this on 2.3.0~rc2 (6395-g1a078b4-0ubuntu1~16.04.1), logs attached.

Changed in maas:
assignee: Mike Pontillo (mpontillo) → Blake Rouse (blake-rouse)
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.