8 nodes transition to failed state within very short period of time

Bug #1454469 reported by Larry Michel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
Undecided
Unassigned

Bug Description

We had 8 nodes transition to failed state within 2 minutes. I couldn't tell from the logs (attached) what could have caused that to happen. It hit that rash of failures then things went relatively quiet for the past day. This resulted in the servers being in pending state in juju_status.yaml with a few of these servers came from one OpenStack deployment build.

See below from juju_status.yaml:

'1':
    agent-state: pending
    containers:
      1/lxc/0:
        instance-id: pending
        series: trusty
      1/lxc/1:
        instance-id: pending
        series: trusty
    dns-name: hayward-43.oil
    hardware: arch=amd64 cpu-cores=8 mem=16384M tags=hw-ok,oil,hardware-sm15k,hw-glance-sm15k
    instance-id: /MAAS/api/1.0/nodes/node-9df8a42a-c4cd-11e3-824b-00163efc5068/
    series: trusty
  '2':
    agent-state: pending
    containers:
      2/lxc/0:
        instance-id: pending
        series: trusty
    dns-name: hayward-21.oil
    hardware: arch=amd64 cpu-cores=8 mem=16384M tags=hw-ok,oil-slave-4,hardware-sm15k,hw-glance-sm15k
    instance-id: /MAAS/api/1.0/nodes/node-a38685ec-c4cd-11e3-8102-00163efc5068/
    series: trusty
  '3':
    agent-state: pending
    containers:
      3/lxc/0:
        instance-id: pending
        series: trusty
    dns-name: apsaras.oil
    hardware: arch=amd64 cpu-cores=4 mem=32768M tags=hw-ok,oil-slave-1,hardware-dell-poweredge-R210,console-com2
    instance-id: /MAAS/api/1.0/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/
    series: trusty
...
  '5':
    agent-state: pending
    containers:
      5/lxc/0:
        instance-id: pending
        series: trusty
      5/lxc/1:
        instance-id: pending
        series: trusty
    dns-name: hayward-49.oil
    hardware: arch=amd64 cpu-cores=8 mem=32768M tags=hw-ok,oil-slave-1,hardware-sm15k,hw-glance-sm15k
    instance-id: /MAAS/api/1.0/nodes/node-9ff3a32e-c4cd-11e3-824b-00163efc5068/
    series: trusty

From maas logs:

Node failed to deploy, but we're looking at a number of 8 failed deployments during a ~2 minute window:
May 11 13:16:29 maas-trusty-back-may22 maas.node: [INFO] hayward-43: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:16:35 maas-trusty-back-may22 maas.node: [INFO] hayward-36: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:16:53 maas-trusty-back-may22 maas.node: [INFO] hayward-21: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:17:18 maas-trusty-back-may22 maas.node: [INFO] apsaras.local: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:18:08 maas-trusty-back-may22 maas.node: [INFO] reading.local: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:18:12 maas-trusty-back-may22 maas.node: [INFO] glover.local: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:18:19 maas-trusty-back-may22 maas.node: [INFO] hayward-55: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:18:19 maas-trusty-back-may22 maas.node: [INFO] hayward-49: Status transition from DEPLOYING to FAILED_DEPLOYMENT

I have attached some logs from maas server.

Tags: oil
Revision history for this message
Larry Michel (lmic) wrote :
description: updated
Revision history for this message
Blake Rouse (blake-rouse) wrote :

The attached regiond log only goes to May 08, and the attached maas.log only goes to Apr 25. Are you sure you attached the correct logs?

Going to mark this as incomplete until correct logs are attached.

Changed in maas:
status: New → Incomplete
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Take that back the maas.log does go to May 8th, just like the regiond.log. But it still does not go to May 11, so I can check what was occurring during that time.

Revision history for this message
Larry Michel (lmic) wrote :

That was the wrong tarball. Here's the right one.

Changed in maas:
status: Incomplete → New
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Looks like an issue with curtin installing to the machine.

May 11 13:16:29 maas-trusty-back-may22 maas.node: [INFO] hayward-43: Status transition from DEPLOYING to FAILED_DEPLOYMENT
May 11 13:16:29 maas-trusty-back-may22 maas.node: [ERROR] hayward-43: Marking node failed: Installation failed (refer to the installation log for more information).

All have similar error in the log. You will need the installation log for this deployment to identify the error. I am going to mark incomplete until the installation log for one of the machines is attached to the bug. Only it will contain the actual reason for the failure. As the node was able to communicate back to MAAS, that curtin had failed to install.

Changed in maas:
status: New → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

I thought there was an existing bug for saving the install log as a file. Assuming that this was fixed in 1.8, what's the location of these logs?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

That's bug 1446916. Until that is fixed, we won't be able to collect logs of this type of failure.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.