instances stuck with task_state of REBOOTING after controller switchover

Bug #1296967 reported by Chris Friesen
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Unassigned

Bug Description

We were doing some testing of Havana and have run into a scenario that ended up with two instances stuck with a task_state of REBOOTING following a reboot of the controller:

1) We reboot the controller.
2) Right after it comes back up something calls compute.api.API.reboot() on an instance.
3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database.
4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
5) That message gets dropped on the floor due to communication issues between the controller and the compute.
6) Now we're stuck with a task_state of REBOOTING.

Currently when doing a reboot we set the REBOOTING task_state in the database in compute-api and then send an RPC cast. That seems awfully risky given that if that message gets lost or the call fails for any reason we could end up stuck in the REBOOTING state forever. I think it might make sense to have the power state audit clear the REBOOTING state if appropriate, but others with more experience should make that call.

It didn't happen to us, but I think we could get into this state another way:

1) nova-compute was running reboot_instance()
2) we reboot the controller
3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None.
4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state.

The timeline that I have looks like this. We had some buggy code that sent all the instances for a reboot when the controller came up. The first two are in the controller logs below, and these are the ones that failed.

controller: (running everything but nova-compute)
nova-api log:

/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 202 len: 185 time: 0.2299521
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 202 len: 185 time: 0.1583798

After this there are other reboot requests for the other instances, and those ones passed.

Interestingly, we later see this
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 409 len: 303 time: 0.1177511
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 409 len: 303 time: 0.0366399

Presumably the 409 responses are because nova thinks that these instances are currently rebooting.

compute:
2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:35.871 12229 INFO nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance
2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT
2014-03-20 11:33:38.367 12229 INFO nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance
2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT
<etc>

As you can see, the two requests that got "lost" were sent during the period after the first batch of AMQP connections, but before the second batch. I didn't log controller-side timestamps for the successful reboot requests but they were after the two failed ones.

Tags: compute
Chris Friesen (cbf123)
description: updated
Tracy Jones (tjones-i)
Changed in nova:
milestone: none → ongoing
tags: added: compute
Revision history for this message
melanie witt (melwitt) wrote :

To gather more information about the issue: have you tried resetting the state of the instances stuck in the rebooting state?

http://docs.openstack.org/admin-guide-cloud/content/reset-state.html

Changed in nova:
status: New → Incomplete
Revision history for this message
Chris Friesen (cbf123) wrote :

Well sure, "nova reset-state --active" will reset the state to active and clear the task_state, but that's an admin-level action.

It would be better to have it be corrected automatically the way we currently handle the power state.

Revision history for this message
melanie witt (melwitt) wrote :

Yes, I just wanted to check if they were resettable i.e. if they weren't, something is more wrong than the task state being inaccurate. I agree it should be corrected automatically.

I've seen similar behavior in a scenarios where a network interruption during an instance delete request results in the instance getting stuck in the "deleting" state.

Changed in nova:
importance: Undecided → High
status: Incomplete → Confirmed
Revision history for this message
Eli Qiao (taget-9) wrote :

can you check if this patch fix this issue?
https://review.openstack.org/#/c/123392/5

Changed in nova:
status: Confirmed → Incomplete
Eli Qiao (taget-9)
Changed in nova:
assignee: nobody → Eli Qiao (taget-9)
Revision history for this message
Eli Qiao (taget-9) wrote :

and FYI, HANAVA is EOL now , https://wiki.openstack.org/wiki/Releases

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Chris Friesen (cbf123):
Could you please check if the issue Eli Qiao provided in comment #4 fixes your issue?

Eli Qiao (taget-9):
I removed you as assignee. If this was wrong because you are working on this, please add yourself as assignee again and change the status to "in progress".

Changed in nova:
assignee: Eli Qiao (taget-9) → nobody
Revision history for this message
Chris Friesen (cbf123) wrote :

It's a bit tricky to try and reproduce since it depends on a race condition. I'll try and carve out some time to analyze the code changes.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Chris Friesen:
This bug report was opened against Havana and the confirmation was
from Melanie (comment #3) during the Juno cycle. You also state
that this is an issue which comes up in certain race conditions
which are hard to reproduce. Given the age and conditions of this
bug report, there is almost no chance to make progress here.
I'm going to deprecate it with "won't fix". If the issue
arises again, just reopen the report by setting it to "New".

Changed in nova:
status: Incomplete → Won't Fix
importance: High → Undecided
milestone: ongoing → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.