[2.0] TFTP issues when rackd — Missing connections to 1 region controller(s).

Bug #1626654 reported by Larry Michel on 2016-09-22
28
This bug affects 9 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
Trunk
Critical
Blake Rouse

Bug Description

We started seeing our EFI systems failing to deploy. It looks to be an issue with the TFTP exchange with the maas server. the errors started happening around 1800 UTC on 9/21 and appears to impact EFI systems only.

Per juju status, this was the error:

machines:
  "0":
    juju-status:
      current: pending
      since: 22 Sep 2016 13:03:55Z
    dns-name: 10.245.0.153
    instance-id: 4y3xat
    machine-status:
      message: 'Failed deployment: bootx64.efi'
      since: 22 Sep 2016 13:45:27Z
    series: trusty
    containers:
      0/lxd/0:
        juju-status:
          current: pending
          since: 22 Sep 2016 13:04:11Z
        instance-id: pending
        machine-status:
          current: pending
          since: 22 Sep 2016 13:04:11Z
        series: trusty
      0/lxd/1:
        juju-status:
          current: pending
          since: 22 Sep 2016 13:04:13Z
        instance-id: pending
        machine-status:
          current: pending
          since: 22 Sep 2016 13:04:13Z
        series: trusty
      0/lxd/2:
        juju-status:
          current: pending
          since: 22 Sep 2016 13:04:15Z
        instance-id: pending
        machine-status:
          current: pending
          since: 22 Sep 2016 13:04:15Z
        series: trusty

In the rackd.log, only seeing:
2016-09-22 16:36:44 [TFTP (UDP)] Datagram received from ('10.245.0.226', 1705): <RRQDatagram(filename=b'bootx64.efi', mode=b'octet', options=OrderedDict([(b'tsize', b'0'), (b'blksize', b'1468')]))>

and monitoring the console of the booting server:
..PXE-E23: Client received TFTP error from server.

On the rackd page, it shows:

Region Controller

   regiond
   bind9
   proxy
Rack Controller

   rackd — Missing connections to 1 region controller(s).
   http
   tftp
   dhcpd
   dhcpd6
   tgt

ubuntu@maas2-production:~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==============================-============-=================================================
ii maas 2.0.0+bzr5189-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.0.0+bzr5189-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.0.0+bzr5189-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.0.0+bzr5189-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Related branches

Larry Michel (lmic) wrote :
summary: - [2.0] PXE EFI boot failures: message: 'Failed deployment: bootx64.efi' -
- PXE-E23 Client received TFTP error from server
+ [2.0] 'Failed deployment: bootx64.efi' - rackd — Missing connections to
+ 1 region controller(s).
summary: - [2.0] 'Failed deployment: bootx64.efi' - rackd — Missing connections to
- 1 region controller(s).
+ [2.0] TFTP issues when rackd — Missing connections to 1 region
+ controller(s).
Changed in maas:
importance: Undecided → Critical
status: New → Confirmed
Gavin Panella (allenap) on 2016-09-26
Changed in maas:
status: Confirmed → Triaged
Larry Michel (lmic) wrote :

This actually turns out to be re-creatable. It happens when doing an image upload.

I recreated it with:

maas root boot-resources create name=windows/win2012hvr2 title="Windows \"Hyper-V 2012 R2\"" architecture=amd64/generic filetype=ddtgz content@=/home/ubuntu/images/win2012hvr2-dd-img-root-dd --debug

In retrospect, original issue happened around the time that I was loading the previous 1.6Gig image. This image is larger 2.9Gig and I saw unresponsiveness immediately after starting the upload and seeing the multiple disconnections and attempts to reconnect. Attaching the latest logs..

For 2.1, I uploaded the same issue and did not observe this.

Larry Michel (lmic) wrote :

Andres, I recreated with 2.1 this morning. Yesterday, it went without a hitch but this morning, the things became sluggish as the upload image was completing or shortly after; I couldn't open the rackd.log for a short while nor could I load any page on the webui. I did see a message "rack controller importing ... " flash on the screen before things from bad to worse. My guess is that it may have been due to part of the upload image overlapping a bit with the import of daily images based on the completion timestamp of the ladder. Once things settled down, I then saw the error in the attached screenshot. All of this would have happened around the ~1320UTC and ~1328UTC time frame. I'll attach the logs shortly.

Larry Michel (lmic) wrote :

logs before restarting maas-rackd attached.

Blake Rouse (blake-rouse) wrote :

The logs don't provide much help at the moment. It does look like the rack is trying to sync often from the region, so the rack believes that the region has information that it does not. Your maas.log is also unhelpful as it full of messages about an IP address moving between two MAC addresses. I recommend checking your network to make sure that same IP address is not assign to two interfaces.

mahmoh (mahmoh) wrote :

Also hitting this on other arches with a previously working server on MAAS 2.0 beta:

2016-10-10 10:05:44 [TFTP (UDP)] Datagram received from ('172.16.3.193', 1583): <RRQDatagram(filename=b'grubaa64.efi', mode=b'octet', options=OrderedDict([(b'tsize', b'0'), (b'blksize', b'1468')]))>

mahmoh (mahmoh) wrote :

^Spoke too soon, the image was not downloaded on the correct maas server, it worked after download: 2.0.0+bzr5189-0ubuntu1~16.04.1

Changed in maas:
milestone: 2.1.0 → 2.1.1
Larry Michel (lmic) wrote :

Recreating with 2.1. I will attach the logs shortly.

Blake Rouse (blake-rouse) wrote :

2016-10-27 19:03:53 [-] WSGI application error
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 857, in callInContext
     return func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 513, in run
     self.reactor.callFromThread(wsgiError, self.started, *exc_info())
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 496, in run
     appIterator = self.application(self.environ, self.startResponse)
   File "/usr/lib/python3/dist-packages/django/core/handlers/wsgi.py", line 189, in __call__
     response = self.get_response(request)
   File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 240, in get_response
     request = reset_request(request)
   File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 90, in reset_request
     wsgi_input._wrapped.seek(0)
 builtins.ValueError: I/O operation on closed file.

2016-10-27 19:03:53 [-] Unhandled Error
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/application/app.py", line 364, in startReactor
     self.config, oldstdout, oldstderr, self.profiler, reactor)
   File "/usr/lib/python3/dist-packages/twisted/application/app.py", line 285, in runReactorWithLogging
     reactor.run()
   File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1194, in run
     self.mainLoop()
   File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
     self.runUntilCurrent()
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 798, in runUntilCurrent
     f(*a, **kw)
   File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 512, in wsgiError
     self.request.finish()
   File "/usr/lib/python3/dist-packages/twisted/web/server.py", line 224, in finish
     return http.Request.finish(self)
   File "/usr/lib/python3/dist-packages/twisted/web/http.py", line 865, in finish
     "Request.finish called on a request after its connection was lost; "
 builtins.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.

Blake Rouse (blake-rouse) wrote :

>>> [process.pid for process in RegionControllerProcess.objects.all()]
[509, 512, 518, 522]

>>> [(conn.endpoint.address, conn.endpoint.port, conn.endpoint.process.pid) for conn in RegionRackRPCConnection.objects.all()]
[('192.168.224.241', 5250, 522), ('10.245.0.10', 5252, 512), ('10.245.0.10', 5253, 518)]

You can see that the rack controller is not connected to pid 509. Now if this causes TFTP issues that is unknown as at the moment OIL is still deploying fine. But it is possible that it causes issues because the rack controller should have tried to re-connect to pid 509. Since it didn't that means the rack is in a bad state and might think that the connection to pid 509 is good. If the rack controller tries to use that connection then it would have an issue. I believe a log would appear if it did and at the moment I am not seeing anything in the logs.

Blake Rouse (blake-rouse) wrote :

The region is showing all endpoints over the RPC endpoint:

{"eventloops": {"maas2-production:pid=512": [["10.245.0.10", 5252], ["192.168.224.241", 5252]], "maas2-production:pid=522": [["192.168.224.241", 5250], ["10.245.0.10", 5250]], "maas2-production:pid=518": [["10.245.0.10", 5253], ["192.168.224.241", 5253]], "maas2-production:pid=509": [["192.168.224.241", 5251], ["10.245.0.10", 5251]]}}

Changed in maas:
milestone: 2.1.1 → 2.1.2
Changed in maas:
milestone: 2.1.2 → 2.1.3
Ryan Beisner (1chb1n) on 2017-01-24
tags: added: uosci
no longer affects: maas/2.0
Ryan Beisner (1chb1n) wrote :

In our OpenStack Charms test lab, the controller periodically and repeatedly goes into error state for rackd in the web ui. Restarting maas-rackd resolves it for some time.

MAAS Version 2.1.3+bzr5573-0ubuntu1 (16.04.1)

no longer affects: maas/2.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers