[2.0 RC2] Multiple failures to release nodes

Bug #1603563 reported by Larry Michel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Lee Trager
2.0
Fix Released
High
Lee Trager
2.1
Fix Released
High
Lee Trager

Bug Description

I am seeing a number of servers that failed to release. These are automated builds not manual deployments. There were 4 failed nodes on one of the maas servers:

ubuntu@maas2-production:/var/log/maas$ grep " to FAILED_RELEA" maas.log.1
Jul 14 20:56:11 maas2-production maas.node: [INFO] tucker: Status transition from RELEASING to FAILED_RELEASING
Jul 15 01:56:50 maas2-production maas.node: [INFO] orangebox10: Status transition from RELEASING to FAILED_RELEASING
ubuntu@maas2-production:/var/log/maas$ grep " to FAILED_RELEA" maas.log
Jul 15 06:31:27 maas2-production maas.node: [INFO] orangebox9: Status transition from RELEASING to FAILED_RELEASING

After seeing these, I tried releasing the nodes manually through the UI and that worked.

ubuntu@maas2-production:/var/log/maas$ grep -i "failed_rele" maas.log
Jul 15 06:31:27 maas2-production maas.node: [INFO] orangebox9: Status transition from RELEASING to FAILED_RELEASING
Jul 15 18:22:55 maas2-production maas.node: [INFO] orangebox3: Status transition from FAILED_RELEASING to RELEASING
Jul 15 18:22:55 maas2-production maas.node: [INFO] orangebox9: Status transition from FAILED_RELEASING to RELEASING
Jul 15 18:22:55 maas2-production maas.node: [INFO] orangebox10: Status transition from FAILED_RELEASING to RELEASING
Jul 15 18:22:55 maas2-production maas.node: [INFO] orangebox6: Status transition from FAILED_RELEASING to RELEASING
ubuntu@maas2-production:/var/log/maas$

but then I saw another 4 failures on another maas server which is also on RC2. I left these in the failed state, but the node manager which polls the server states may try to release them.

ubuntu@maas2-integration:/var/log/maas$ grep -i "failed_rele" maas.log
Jul 15 11:48:27 maas2-integration maas.node: [INFO] prunus: Status transition from RELEASING to FAILED_RELEASING
Jul 15 11:48:27 maas2-integration maas.node: [INFO] kobusch: Status transition from RELEASING to FAILED_RELEASING
Jul 15 19:35:17 maas2-integration maas.node: [INFO] hayward-54: Status transition from RELEASING to FAILED_RELEASING
Jul 15 19:35:22 maas2-integration maas.node: [INFO] hayward-63: Status transition from RELEASING to FAILED_RELEASING

ubuntu@maas2-production:~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==================================-============-=================================================
ii maas 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS server common files
ii maas-dhcp 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS DHCP server
ii maas-dns 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS DNS server
ii maas-proxy 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS Caching Proxy
ii maas-rack-controller 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all Rack Controller for MAAS
ii maas-region-api 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all Region controller API service for MAAS
ii maas-region-controller 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.0.0~rc2+bzr5156-0ubuntu1~xenial1 all MAAS server provisioning libraries (Python 3)

I am attaching the logs for both maas servers (logs1.tar.gz for first one and logs2.tar.gz for second server).

Tags: oil

Related branches

Revision history for this message
Larry Michel (lmic) wrote :
description: updated
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Larry,

Can you please provide node event logs as well?

Changed in maas:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Also, I see this in the logs:

provisioningserver.drivers.power.PowerAuthError: Incorrect password. Check BMC configuration and try again.

This may be causing the failure in releasing... are you sure you have the right credentials?

Revision history for this message
Larry Michel (lmic) wrote :

Hi Andres,
Yes, they have correct credentials. Retrying to release them manually worked. How do I get the node event logs without having to go through the GUI?

Revision history for this message
Larry Michel (lmic) wrote :
Download full text (9.8 KiB)

Trying to get the events log seems to be an iterative process because I can only get 100 entries at a time and with the Power state queries INFO entries there, there are literally thousands of entries. If I select level=ERROR, then there's not much returned and I still have to figure out So I just got them from the GUI

ubuntu@maas2-integration:~/lmic/events$ maas root events query id=4y3h8b
Success.
Machine-readable output follows:
{
    "events": [
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 12:10:37",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1463165
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 12:05:37",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1463080
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 12:00:22",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1463018
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 11:55:22",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1462955
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 11:50:08",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1462892
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 11:44:53",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1462830
        },
....
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 05:17:52",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1447596
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 05:12:37",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1447481
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
            "created": "Wed, 20 Jul. 2016 05:07:37",
            "level": "INFO",
            "type": "Queried node's BMC",
            "id": 1447378
        },
        {
            "hostname": "prunus",
            "description": "Power state queried: off",
            "node": "4y3h8b",
...

Read more...

Revision history for this message
Larry Michel (lmic) wrote :

event logs for 2 of the nodes that failed to release.

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Larry,

Ok, I see something strange here:

Node powered off Sun, 17 Jul. 2016 17:04:30
Powering node off Sun, 17 Jul. 2016 17:04:27
Node changed status - From 'Releasing' to 'Releasing failed' Sun, 17 Jul. 2016 17:04:26
User releasing node - (oil-slave-5) - Released by Juju MAAS provider Sun, 17 Jul. 2016 17:04:26

If you see, the node is told to release, but it immediately move to Releasing failed. But 1 second later the node is told to power off. So I see that the power action is done correctly, at least according to the node event log, but it seems the monitor transitioned to failed to release without waiting for the power action.

Can you attach /var/log/maas/rackd.log,regiond.log,maas.log of the eact moment this is happening please?

Changed in maas:
importance: Undecided → High
milestone: none → 2.0.0
Revision history for this message
Larry Michel (lmic) wrote :

Hi Andres,

Attached are the logs for that window.

Revision history for this message
Newell Jensen (newell-jensen) wrote :

Larry,

Are you sure you don't have some sort of network credential issues with some of your hardware? I see these errors in your logs.tar.gz --> logs1.tar.gz --> maas.log:

Jul 15 19:35:17 maas2-integration maas.node: [INFO] hayward-54: Status transition from RELEASING to FAILED_RELEASING
Jul 15 19:35:17 maas2-integration maas.node: [ERROR] hayward-54: Marking node failed: Node could not be powered off: Failed talking to node's BMC: Not Found: servers/54/0 (HTTP 404)
Jul 15 19:35:20 maas2-integration maas.node: [INFO] hayward-63: Releasing node
Jul 15 19:35:20 maas2-integration maas.node: [INFO] hayward-63: Status transition from DEPLOYED to RELEASING
Jul 15 19:35:21 maas2-integration maas.power: [INFO] Changing power state (off) of node: hayward-63 (4y3hfs)
Jul 15 19:35:22 maas2-integration maas.power: [ERROR] Error changing power state (off) of node: hayward-63 (4y3hfs)
Jul 15 19:35:22 maas2-integration maas.node: [INFO] hayward-63: Status transition from RELEASING to FAILED_RELEASING
Jul 15 19:35:22 maas2-integration maas.node: [ERROR] hayward-63: Marking node failed: Node could not be powered off: Failed talking to node's BMC: Not Found: servers/63/0 (HTTP 404)

Revision history for this message
Newell Jensen (newell-jensen) wrote :

After talking with Larry going to set to incomplete until this can be recreated.

Revision history for this message
Larry Michel (lmic) wrote :

We're still getting recreates. Here's the latest case (screenshot attached). I will also attach the logs from that time frame.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Strange. When the releasing is done, we try to release the machine but if it is not released after the monitoring period, we mark it as releasing failed.

What seems to be the case here is that:

1. juju tells maas to release
2. machine never powers off/release (we need to find out why
3. monitor marks it failed releasing.

Also, can you reproduce this? It may be that at the moment of the request the rack controller wasn't connected.

Revision history for this message
Larry Michel (lmic) wrote :
Download full text (3.2 KiB)

Yes it can be recreated Andres. For example, in the screeshot attached, there are 3 nodes that seem to be recreates:

Lambert:
Node powered off Sat, 20 Aug. 2016 11:17:00
Node changed status - From 'Releasing' to 'Releasing failed' Sat, 20 Aug. 2016 11:16:46
Powering node off Sat, 20 Aug. 2016 11:16:46
Marking node failed - Machine operation 'Releasing' timed out after 5 minutes. Sat, 20 Aug. 2016 11:16:46
User releasing node - (production-qmaster) Sat, 20 Aug. 2016 11:16:45
Queried node's BMC - Power state queried: on Sat, 20 Aug. 2016 10:11:55
Node changed status - From 'Deploying' to 'Deployed' Sat, 20 Aug. 2016 10:11:35
Installation complete - Node disabled netboot Sat, 20 Aug. 2016 10:09:06
PXE Request - installation Sat, 20 Aug. 2016 10:07:16
Node powered on

Drapion:
Node powered off Sun, 21 Aug. 2016 12:54:52
Node changed status - From 'Releasing' to 'Releasing failed' Sun, 21 Aug. 2016 12:54:46
Powering node off Sun, 21 Aug. 2016 12:54:46
Marking node failed - Machine operation 'Releasing' timed out after 5 minutes. Sun, 21 Aug. 2016 12:54:46
User releasing node - (production-qmaster) Sun, 21 Aug. 2016 12:54:45
Queried node's BMC - Power state queried: on Sun, 21 Aug. 2016 11:52:34
Node changed status - From 'Deploying' to 'Deployed' Sun, 21 Aug. 2016 11:52:13
PXE Request - local boot Sun, 21 Aug. 2016 11:51:49
PXE Request - local boot Sun, 21 Aug. 2016 11:51:48
PXE Request - local boot Sun, 21 Aug. 2016 11:51:48

Pullman-02:

Node powered off Sat, 20 Aug. 2016 08:18:48
Node changed status - From 'Releasing' to 'Releasing failed' Sat, 20 Aug. 2016 08:18:46
Marking node failed - Machine operation 'Releasing' timed out after 5 minutes. Sat, 20 Aug. 2016 08:18:46
Powering node off Sat, 20 Aug. 2016 08:18:46
User releasing node - (production-qmaster) Sat, 20 Aug. 2016 08:18:45
Queried node's BMC - Power state queried: on Sat, 20 Aug. 2016 07:08:04
Node changed status - From 'Deploying' to 'Deployed' Sat, 20 Aug. 2016 07:07:44
PXE Request - local boot Sat, 20 Aug. 2016 07:07:26
PXE Request - local boot Sat, 20 Aug. 2016 07:07:26
Installation complete - Node disabled netboot Sat, 20 Aug. 2016 07:05:54

The last one is actually due to a BMC issue so you can disregard that one. It's got a shared NIC, and there are connectivity issues associated with it.

reimarus:

 Failed to power off node - Node could not be powered off: Could not contact node's BMC: Connection timed out while performing power action. Check BMC configuration and connectivity and try again. Sat, 20 Aug. 2016 06:41:46
Node changed status - From 'Releasing' to 'Releasing failed' Sat, 20 Aug. 2016 06:41:46
Marking node failed - Node could not be powered off: Could not contact node's BMC: Connection timed out while performing power action. Check BMC configuration and connectivity and try again. Sat, 20 Aug. 2016 06:41:46
Powering node off Sat, 20 Aug. 2016 06:40:41
User releasing node - (production-qmaster) Sat, 20 Aug. 2016 06:40:41
Queried node's BMC - Power state queried: on Sat, 20 Aug. 2016 05:41:55
Node changed status - From 'Deploying' to 'Deployed' Sat, 20 Aug. 2016 05:41:33
PXE Request - local boot Sat, 20 Aug. 2016 05:41:10
Installation complete - Node disabled netbo...

Read more...

Revision history for this message
Larry Michel (lmic) wrote :

Attaching new logs for that time period:

Gavin Panella (allenap)
Changed in maas:
status: New → Triaged
Revision history for this message
David Britton (dpb) wrote :

Hit a similar issue. 'release-failed'. it said that it waited 5 mintues, but the log indicates it was 1 second.

Machine to search for is 'bierstadt.maas'

And release failed around the 20/21 of oct 2016

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :

logs including the .1 and gziped ones too.

Revision history for this message
David Britton (dpb) wrote :

screenshot showing the odd timestamps. notice 1 second when 5 minutes should have been used.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.