Node powered off after 'PXE request - power off' during deployment, does not get marked as failed and goes from Deploying to Allocated state

Bug #1427573 reported by Larry Michel
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned

Bug Description

Seeing this sequence of events where a node starts to deploy:
- Goes through transitions: Ready-> Allocated -> Deploying states
- PXE boots but no curtin install - PXE power off request instead
- is powered down - and remains powered off for the duration of the juju deployment.
- Sometimes later when juju environment is teared down, it actually transitions from Allocated to Ready. The node did not get to deployed nor failed state and somehow transitioned from Deploying to Allocated state (transition to Allocated not shown in the Event logs nor in the maas logs)

We have seen this happen a number of times. Here's one example:

From Events log:

INFO Sun, 01 March 2015 04:45:23 Node changed status — From 'Allocated' to 'Ready'
INFO Sun, 01 March 2015 04:45:23 Powering node off
INFO Sun, 01 March 2015 04:42:21 Node powered off
INFO Sun, 01 March 2015 04:42:20 Powering node off
DEBUG Sun, 01 March 2015 03:28:34 TFTP Request — ubuntu/amd64/generic/trusty/daily/boot-initrd
INFO Sun, 01 March 2015 03:28:33 PXE Request — power off
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — ubuntu/amd64/generic/trusty/daily/boot-kernel
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — ifcpu64.c32
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — pxelinux.cfg/01-9c-b6-54-0c-9b-2c
INFO Sun, 01 March 2015 03:28:33 PXE Request — power off
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — pxelinux.cfg/01-9c-b6-54-0c-9b-2c
DEBUG Sun, 01 March 2015 03:28:33 TFTP Request — pxelinux.cfg/36363436-3434-5553-4534-313356485250
DEBUG Sun, 01 March 2015 03:28:32 TFTP Request — pxelinux.0
DEBUG Sun, 01 March 2015 03:28:32 TFTP Request — pxelinux.0
INFO Sun, 01 March 2015 03:25:41 Node powered on
INFO Sun, 01 March 2015 03:25:33 Powering node on
INFO Sun, 01 March 2015 03:25:16 Node changed status — From 'Allocated' to 'Deploying'
INFO Sun, 01 March 2015 03:25:14 Node changed status — From 'Ready' to 'Allocated' (to oil-slave-5)

From Maas logs:
grep "Mar 1 0[3|4]" /var/log/maas/maas.log|grep euclid
...
Mar 1 03:25:14 maas-trusty-back-may22 maas.node: [INFO] euclid.local allocated to user oil-slave-5
Mar 1 03:25:32 maas-trusty-back-may22 maas.node: [INFO] euclid.local: Starting monitor: {u'deadline': datetime.datetime(2015, 3, 1, 4, 5, 32, 867854, tzinfo=<twisted.protocols.amp._FixedOffsetTZInfo object at 0x7f5e172e4d90>), u'id': u'node-7918406a-3dfa-11e4-b38a-00163eca07b6', u'context': {u'node_status': 9, u'timeout': 2400.0}}
Mar 1 03:25:33 maas-trusty-back-may22 maas.power: [INFO] Changing power state (on) of node: euclid.local (node-7918406a-3dfa-11e4-b38a-00163eca07b6)
Mar 1 03:25:41 maas-trusty-back-may22 maas.power: [INFO] Changed power state (on) of node: euclid.local (node-7918406a-3dfa-11e4-b38a-00163eca07b6)
Mar 1 03:30:18 maas-trusty-back-may22 maas.power: [INFO] euclid.local: Power state has changed from on to off.
Mar 1 04:42:20 maas-trusty-back-may22 maas.power: [INFO] Changing power state (off) of node: euclid.local (node-7918406a-3dfa-11e4-b38a-00163eca07b6)
Mar 1 04:42:21 maas-trusty-back-may22 maas.power: [INFO] Changed power state (off) of node: euclid.local (node-7918406a-3dfa-11e4-b38a-00163eca07b6)
Mar 1 04:45:23 maas-trusty-back-may22 maas.node: [INFO] euclid.local: Releasing node
Mar 1 04:45:23 maas-trusty-back-may22 maas.power: [INFO] Changing power state (off) of node: euclid.local (node-7918406a-3dfa-11e4-b38a-00163eca07b6)
Mar 1 04:45:23 maas-trusty-back-may22 maas.node: [INFO] euclid.local: Deleting DHCP host maps
Mar 1 04:45:25 maas-trusty-back-may22 maas.power: [INFO] Changed power state (off) of node: euclid.local (node-7918406a-3dfa-11e4-b38a-00163eca07b6)
...

From pserv.log:

ubuntu@maas-trusty-back-may22:~/logs$ grep "2015-03-01 0[3|4]:[1|2|3|4|5]" ~/logs/pserv.log.1|grep "10.245.0.199"
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2070): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'tsize': '0'})>
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2070): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'tsize': '0'})>
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2071): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1456'})>
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2071): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1456'})>
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49152): <RRQDatagram(filename=pxelinux.cfg/36363436-3434-5553-4534-313356485250, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:32+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49152): <RRQDatagram(filename=pxelinux.cfg/36363436-3434-5553-4534-313356485250, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(filename=pxelinux.cfg/01-9c-b6-54-0c-9b-2c, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(filename=pxelinux.cfg/01-9c-b6-54-0c-9b-2c, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(filename=pxelinux.cfg/01-9c-b6-54-0c-9b-2c, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49153): <RRQDatagram(filename=pxelinux.cfg/01-9c-b6-54-0c-9b-2c, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49154): <RRQDatagram(filename=ifcpu64.c32, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49154): <RRQDatagram(filename=ifcpu64.c32, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49155): <RRQDatagram(filename=ubuntu/amd64/generic/trusty/daily/boot-kernel, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:33+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49155): <RRQDatagram(filename=ubuntu/amd64/generic/trusty/daily/boot-kernel, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:34+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49156): <RRQDatagram(filename=ubuntu/amd64/generic/trusty/daily/boot-initrd, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
2015-03-01 03:28:34+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 49156): <RRQDatagram(filename=ubuntu/amd64/generic/trusty/daily/boot-initrd, mode=octet, options={'tsize': '0', 'blksize': '1408'})>
<Larry COMMENT>This next entry is for different build<Larry ENDCOMMENT> 2015-03-01 04:51:37+0000 [TFTP (UDP)] Datagram received from ('10.245.0.199', 2070): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'tsize': '0'})>
...

From console.txt:

2015-03-01 03:26:43,808 [DEBUG] deployer.import: Adding units...
2015-03-01 03:26:45,098 [WARNING] deployer.import: Config specifies num units for subordinate: ceilometer-agent
2015-03-01 03:26:45,099 [DEBUG] deployer.import: Service 'ceph' does not need any more units added.
2015-03-01 03:26:45,100 [DEBUG] deployer.import: Service 'cinder' does not need any more units added.
2015-03-01 03:26:45,102 [DEBUG] deployer.import: Service 'nova-cloud-controller' does not need any more units added.
2015-03-01 03:26:45,103 [DEBUG] deployer.import: Service 'ceilometer' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'glance' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'heat' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'keystone' does not need any more units added.
2015-03-01 03:26:45,104 [DEBUG] deployer.import: Service 'mongodb' does not need any more units added.
2015-03-01 03:26:45,105 [DEBUG] deployer.import: Service 'mysql' does not need any more units added.
2015-03-01 03:26:45,107 [INFO] deployer.import: Adding 2 more units to nova-compute
2015-03-01 03:26:55,638 [DEBUG] deployer.import: Service 'openstack-dashboard' does not need any more units added.
2015-03-01 03:26:55,639 [DEBUG] deployer.import: Service 'rabbitmq-server' does not need any more units added.
2015-03-01 03:26:55,639 [DEBUG] deployer.import: Waiting for units before adding relations
2015-03-01 03:26:59,325 [DEBUG] deployer.env: Delta unit: nova-compute/2 change:pending
2015-03-01 03:26:59,408 [DEBUG] deployer.env: Delta machine: 3 change:pending
2015-03-01 03:27:29,332 [DEBUG] deployer.env: Delta machine: 4 change:pending
2015-03-01 03:27:29,427 [DEBUG] deployer.env: Delta machine: 5 change:pending
2015-03-01 03:28:24,355 [DEBUG] deployer.env: Delta machine: 5 change:pending
2015-03-01 03:29:14,360 [DEBUG] deployer.env: Delta machine: 4 change:pending
2015-03-01 03:29:24,362 [DEBUG] deployer.env: Delta machine: 5 change:pending
2015-03-01 03:32:29,382 [DEBUG] deployer.env: Delta machine: 1 change:pending
2015-03-01 03:32:34,384 [DEBUG] deployer.env: Delta machine: 1 change:pending
2015-03-01 03:32:34,466 [DEBUG] deployer.env: Delta machine: 1 change:started
2015-03-01 03:32:34,466 [DEBUG] deployer.env: Delta unit: ceph/0 change:pending
...
2015-03-01 04:41:56,117 [ERROR] deployer.import: Reached deployment timeout.. exiting

From juju_status.yaml:

  '2':
    agent-state: pending
    containers:
      2/lxc/0:
        instance-id: pending
        series: trusty
      2/lxc/1:
        instance-id: pending
        series: trusty
    dns-name: euclid.oil
    hardware: arch=amd64 cpu-cores=32 mem=49152M tags=hw-ok,hardware-hp-proliant-SL4540-G8,HP-SL
    instance-id: /MAAS/api/1.0/nodes/node-7918406a-3dfa-11e4-b38a-00163eca07b6/
    series: trusty

Something to note that may or may not be of relevance: juju-deployer does not show log for machine 2 as change:pending in the console.txt but juju_status.yaml does show it as pending.

Larry Michel (lmic)
summary: - Node powered off after PXE power off request during deployment, is not
- marked as failed and goes from Deploying to Allocated state
+ Node powered off after 'PXE request - power off' during deployment, does
+ not get marked as failed and goes from Deploying to Allocated state
description: updated
Revision history for this message
Larry Michel (lmic) wrote :

I see another system with same issue where the change:pending log does show in the console.txt so that last observation is likely a moot point.

Revision history for this message
Raphaël Badin (rvb) wrote :

> The node did not get to deployed nor failed state and somehow transitioned from Deploying to Allocated state (transition to
> Allocated not shown in the Event logs nor in the maas logs)

The fact that the node transitioned from Deploying to Allocated is the crux of the matter. Since an Allocated node is not supposed to boots up, MAAS will issue a 'power off' PXE template if it does (i.e. if the node boots in this state, it will get a tiny PXE config that will immediately power it off). This is what you're seeing here.

In other words:
- Goes through transitions: Ready-> Allocated -> Deploying states
- Node goes from Deploying to Allocated (*we need to identify what caused this*)
- PXE boots but no curtin install - PXE power off request instead

Can you provide the apache log (I assume this is with 1.7) at the time this happened (to see if Juju —or any other client— changed the status of the node)?

Changed in maas:
importance: Undecided → Critical
status: New → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

Attached are the apache logs. Also, see bellow what appears to be latest recreate from node's event log:

INFO Mon, 09 March 2015 00:48:48 Powering node off
DEBUG Sun, 08 March 2015 23:19:47 TFTP Request — ubuntu/amd64/generic/trusty/daily/boot-initrd
INFO Sun, 08 March 2015 23:19:45 PXE Request — power off
DEBUG Sun, 08 March 2015 23:19:45 TFTP Request — ubuntu/amd64/generic/trusty/daily/boot-kernel
DEBUG Sun, 08 March 2015 23:19:45 TFTP Request — ifcpu64.c32
DEBUG Sun, 08 March 2015 23:19:45 TFTP Request — pxelinux.cfg/01-00-22-99-e0-00-77
INFO Sun, 08 March 2015 23:19:45 PXE Request — power off
DEBUG Sun, 08 March 2015 23:19:44 TFTP Request — pxelinux.cfg/01-00-22-99-e0-00-77
INFO Sun, 08 March 2015 23:19:44 PXE Request — power off
DEBUG Sun, 08 March 2015 23:19:44 TFTP Request — pxelinux.cfg/01-00-22-99-e0-00-77
DEBUG Sun, 08 March 2015 23:19:44 TFTP Request — pxelinux.cfg/d1650e4f-b658-6d3f-b03e-f95b87f65f3d
DEBUG Sun, 08 March 2015 23:19:44 TFTP Request — pxelinux.0
INFO Sun, 08 March 2015 23:19:17 Node powered on
INFO Sun, 08 March 2015 23:19:14 Powering node on
INFO Sun, 08 March 2015 23:18:53 Node changed status — From 'Allocated' to 'Deploying'
INFO Sun, 08 March 2015 23:18:45 Node changed status — From 'Ready' to 'Allocated' (to oil-slave-6)

Changed in maas:
status: Incomplete → New
Changed in maas:
milestone: none → 1.8.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We'll wait and see if this shows up in 1.8.

Changed in maas:
assignee: nobody → Ricardo Bánffy (rbanffy)
assignee: Ricardo Bánffy (rbanffy) → nobody
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (11.0 KiB)

Recreating this with 1.8b3.

What we are seeing from juju_status.yaml:

 '1':
    agent-state-info: "gomaasapi: got error back from server: 500 INTERNAL SERVER\
      \ ERROR (<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"\
      http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\"\
      \ xml:lang=\"en\" lang=\"en\">\n <head>\n <title>Error: Internal server\
      \ error</title>\n </head>\n <body>\n <h2>\n Internal server error.\n\
      \ </h2>\n </body>\n</html>\n)"
    containers:
      1/lxc/0:
        instance-id: pending
        series: trusty
    instance-id: pending
    series: trusty

From events.log:

 Node changed status - From 'Deploying' to 'Ready' Sun, 19 Apr. 2015 10:48:10
Node powered off Sun, 19 Apr. 2015 10:45:30
Powering node off Sun, 19 Apr. 2015 10:45:05
Node powered off Sun, 19 Apr. 2015 09:25:44
Node changed status - From 'Allocated' to 'Deploying' Sun, 19 Apr. 2015 09:25:37
Node changed status - From 'Ready' to 'Allocated' (to oil) Sun, 19 Apr. 2015 09:25:37

From regiond.log:

2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "GET /MAAS/api/1.0/nodes/node-a38685ec-c4cd-11e3-8102-00163efc5068/?op=details HTTP/1.1" 200 6598 "-" "Go 1.1 package http"
2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:36 +0000] "POST /MAAS/api/1.0/nodes/node-a38685ec-c4cd-11e3-8102-00163efc5068/?op=release HTTP/1.1" 200 801 "-" "Python-httplib2/0.7.2 (gzip)"
2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:36 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 409 62 "-" "Go 1.1 package http"
2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:36 +0000] "GET /MAAS/api/1.0/networks/?node=node-a38685ec-c4cd-11e3-8102-00163efc5068 HTTP/1.1" 200 221 "-" "Go 1.1 package http"
2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "GET /MAAS/api/1.0/nodes/?nodes=node-9d30e4e4-c4cd-11e3-8102-00163efc5068&op=deployment_status HTTP/1.1" 200 64 "-" "Go 1.1 package http"
2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 409 62 "-" "Go 1.1 package http"
2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:36 +0000] "GET /MAAS/api/1.0/networks/oil-mgmt/?op=list_connected_macs HTTP/1.1" 200 3219 "-" "Go 1.1 package http"
2015-04-19 09:25:37 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 409 62 "-" "Go 1.1 package http"
2015-04-19 09:25:38 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 409 62 "-" "Go 1.1 package http"
2015-04-19 09:25:38 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 409 62 "-" "Go 1.1 package http"
2015-04-19 09:25:38 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 409 62 "-" "Go 1.1 package http"
2015-04-19 09:25:38 [-] 127.0.0.1 - - [19/Apr/2015:09:25:37 +0000] "POST /MAAS/api/1.0/nodes/node-a35076c8-c4cd-11e3-8102-00163efc5068/?op=release HTTP/1.1" 200 795 "-" "Python-httplib2/0.7.2 (gzip)"
2015-04-19 09:25:38 [-] 127.0.0.1 - ...

Revision history for this message
Mike Pontillo (mpontillo) wrote :

This may be a side comment, but...

From the log, a node is being released, and subsequently re-acquired. But it cannot be immediately re-acquired, since it is still releasing. In parallel, something is attempting to start the node (which is an error) and causes a 500 to be returned with provisioningserver.rpc.exceptions.PowerActionAlreadyInProgress.

If it was the power action failure alone, we should return a 409 (since it can be retried). In this case, it seems like we should catch the attempt to start a "Releasing" node sooner, and throw a more appropriate error.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Jason / Larry,

Can you confirm this is the behavior in OIL? Or the usage of MAAS by OIL?

Changed in maas:
status: New → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

In OIL, the scenario for this to happen would be at the start of a build where a release is performed on all the nodes as a sanity check. Ideally, the nodes would already be in the ready state by being properly released by the previous build; so the release should fail which is OK in this case.

When starting the bootstrap node, OIL will attempt to deploy the bootstrap node and if some reason that node was stuck in releasing state then the scenario described by Mike would happen.

Gavin Panella (allenap)
Changed in maas:
status: Incomplete → Triaged
Changed in maas:
importance: Critical → High
Changed in maas:
importance: High → Critical
Revision history for this message
Andres Rodriguez (andreserl) wrote :

We believe this is no longer an issue in the latest releases of MAAS. Please upgrade to the latest version of MAAS, and If you believe this issue is still present, please re-open this bug report or file a new one.

Changed in maas:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.