[2.2 beta3] multiple machines allocated and do not transition to Deploying

Bug #1671651 reported by Larry Michel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned

Bug Description

Multiple servers are allocated but do not transition to deploying state.

I see a bunch of servers that act like they're being edited and this happens typically when there's an issue with the region controller. However, looking at it's status I don't see it being disconnected (could be intermittent).

drapion.oilstaging Allocated oil 40 64.0 1 299.0
 hayward-04.oilstaging Allocated oil 8 16.0 2 429.5
 hayward-05.oilstaging Allocated oil 8 16.0 2 429.5
 hayward-13.oilstaging Allocated oil 8 15.7 2 429.5
 hayward-17.oilstaging Allocated oil 8 16.0 2 429.5
 hayward-19.oilstaging Allocated oil 8 16.0 2 429.5
 hayward-20.oilstaging Allocated oil 8 16.0 2 429.5
 hayward-40.oilstaging Allocated oil 8 16.0 2 429.5
 tucker.oilstaging Allocated oil 12 32.0 6 6001.2
 velie.oilstaging Allocated oil 12 32.0 6 6001.2

ubuntu@maas2-integration-daily:~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-=================================================
ii maas 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.2.0~beta3+bzr5795-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Larry Michel (lmic) wrote :
tags: added: cdo-qa-blocker
Changed in maas:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Larry,

It is really hard to debug when you simply grab all the logs and don't provide those of the exact timeframe when you are having the issues. As such please attach:

1. Provide logs on the timeframe you are seeing the issue.
2. Provide debug logs from juju when it is making this requests.
3. Provide your juju configuration for MAAS provider.

Additionally, please do the following:

Edit /usr/lib/python3/dist-packages/maasserver/djangosettings/settings.py > And change the variable DEBUG = False to DEBUG = True, restart maas-regiond and see whether there are any errors when the issue you have is reproduced.

Capture the logs when the issue you see are happening.

That said, it is juju that:
1. allocates a machine (we can see that the request came in just fine).
2. once machine is allocate, it requests to deploy a machine. (which doesn't seem to be seen).

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Also, please attache the full event log for each of this machines during the time of the issue.

Revision history for this message
Larry Michel (lmic) wrote :

@Andres,
The time frame was basically the same as when I provided the logs. I observed this issue and opened the bug and collected the logs around the same time frame. I am basically recreating this on every single deployments.. hayward-04, hayward-20, hayward-40 and hayward-17 were servers I was deploying over and over and they were going to allocated and not transitioning on a consistent basis. I have attached last 1000 entries for all the event logs.

clouds:
  maas:
    type: maas
    auth-types: [oauth1]
    endpoint: http://10.244.192.10:5240/MAAS/

credentials:
  maas:
    oil-ci:
      auth-type: oauth1
      maas-oauth: *******************************************************

Revision history for this message
Larry Michel (lmic) wrote :

I have applied the debug settings and will monitor for recreate:

root@maas2-integration-daily:~# diff /usr/lib/python3/dist-packages/maasserver/djangosettings/settings.py /usr/lib/python3/dist-packages/maasserver/djangosettings/settings.py.ORIG
52c52
< DEBUG = True
---
> DEBUG = False
root@maas2-integration-daily:~# sudo service maas-regiond restart
sudo: unable to resolve host maas2-integration-daily
root@maas2-integration-daily:~# sudo service maas-regiond status
sudo: unable to resolve host maas2-integration-daily
● maas-regiond.service - MAAS Region Controller
   Loaded: loaded (/lib/systemd/system/maas-regiond.service; enabled; vendor preset: enabled)
   Active: active (exited) since Fri 2017-03-10 17:06:11 UTC; 20s ago
     Docs: https://maas.io/
  Process: 37256 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
 Main PID: 37256 (code=exited, status=0/SUCCESS)
    Tasks: 0
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/maas-regiond.service

Mar 10 17:06:11 maas2-integration-daily systemd[1]: Starting MAAS Region Controller...
Mar 10 17:06:11 maas2-integration-daily systemd[1]: Started MAAS Region Controller.
root@maas2-integration-daily:~# date
Fri Mar 10 17:06:38 UTC 2017

Revision history for this message
Andres Rodriguez (andreserl) wrote :

@ Larry,

The logs you provided cover a period of 5 days. It would be much better if you share the logs over the period that you saw a certain set of specific issues (i.e. around specific hours or specific days, not a whole week).

That said, if you see this issue again, can you please attach the logs with DEBUG enabled ?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We haven't seen this one again afaik.

Chris Gregan (cgregan)
Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.