OpenStack Compute (nova)

instances stuck with task_state of REBOOTING after controller switchover

Bug #1296967 reported by Chris Friesen on 2014-03-24

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Won't Fix	Undecided	Unassigned

Bug Description

We were doing some testing of Havana and have run into a scenario that ended up with two instances stuck with a task_state of REBOOTING following a reboot of the controller:

1) We reboot the controller.
2) Right after it comes back up something calls compute.api.API.reboot() on an instance.
3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database.
4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
5) That message gets dropped on the floor due to communication issues between the controller and the compute.
6) Now we're stuck with a task_state of REBOOTING.

Currently when doing a reboot we set the REBOOTING task_state in the database in compute-api and then send an RPC cast. That seems awfully risky given that if that message gets lost or the call fails for any reason we could end up stuck in the REBOOTING state forever. I think it might make sense to have the power state audit clear the REBOOTING state if appropriate, but others with more experience should make that call.

It didn't happen to us, but I think we could get into this state another way:

1) nova-compute was running reboot_instance()
2) we reboot the controller
3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None.
4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state.

The timeline that I have looks like this. We had some buggy code that sent all the instances for a reboot when the controller came up. The first two are in the controller logs below, and these are the ones that failed.

controller: (running everything but nova-compute)
nova-api log:

/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 202 len: 185 time: 0.2299521
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 202 len: 185 time: 0.1583798

After this there are other reboot requests for the other instances, and those ones passed.

Interestingly, we later see this
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 409 len: 303 time: 0.1177511
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 409 len: 303 time: 0.0366399

Presumably the 409 responses are because nova thinks that these instances are currently rebooting.

compute:
2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:35.871 12229 INFO nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance
2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT
2014-03-20 11:33:38.367 12229 INFO nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance
2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT
<etc>

As you can see, the two requests that got "lost" were sent during the period after the first batch of AMQP connections, but before the second batch. I didn't log controller-side timestamps for the successful reboot requests but they were after the two failed ones.

See original description

Tags:

Chris Friesen (cbf123) on 2014-03-25

description:

updated

Tracy Jones (tjones-i) on 2014-03-26

Changed in nova:
milestone:	none → ongoing
tags:	added: compute