Activity log for bug #1296967

Date Who What changed Old value New value Message
2014-03-24 21:45:56 Chris Friesen bug added bug
2014-03-25 19:39:12 Chris Friesen description We were doing some testing of Havana and have run into a scenario that ended up with two instances stuck with a task_state of REBOOTING following a reboot of the controller: 1) We reboot the controller. 2) Right after it comes back up something calls compute.api.API.reboot() on an instance. 3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database. 4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast. 5) That message gets dropped on the floor due to communication issues between the controller and the compute. 6) Now we're stuck with a task_state of REBOOTING. Currently when doing a reboot we set the REBOOTING task_state in the database in compute-api and then send an RPC cast. That seems awfully risky given that if that message gets lost or the call fails for any reason we could end up stuck in the REBOOTING state forever. I think it might make sense to have the power state audit clear the REBOOTING state if appropriate, but others with more experience should make that call. It didn't happen to use, but I think we could get into this state another way: 1) nova-compute was running reboot_instance() 2) we reboot the controller 3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None. 4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state. The timeline that I have looks like this. We had some buggy code that sent all the instances for a reboot when the controller came up. The first two are in the controller logs below, and these are the ones that failed. controller: (running everything but nova-compute) nova-api log: /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 202 len: 185 time: 0.2299521 /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 202 len: 185 time: 0.1583798 After this there are other reboot requests for the other instances, and those ones passed. Interestingly, we later see this /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 409 len: 303 time: 0.1177511 /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 409 len: 303 time: 0.0366399 Presumably the 409 responses are because nova thinks that these instances are currently rebooting. compute: 2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:35.871 12229 INFO nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT 2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance 2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT 2014-03-20 11:33:38.367 12229 INFO nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT 2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance 2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT <etc> As you can see, the two requests that got "lost" were sent during the period after the first batch of AMQP connections, but before the second batch. I didn't log controller-side timestamps for the successful reboot requests but they were after the two failed ones. We were doing some testing of Havana and have run into a scenario that ended up with two instances stuck with a task_state of REBOOTING following a reboot of the controller: 1) We reboot the controller. 2) Right after it comes back up something calls compute.api.API.reboot() on an instance. 3) That sets instance.task_state = task_states.REBOOTING and then calls instance.save() to update the database. 4) Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast. 5) That message gets dropped on the floor due to communication issues between the controller and the compute. 6) Now we're stuck with a task_state of REBOOTING. Currently when doing a reboot we set the REBOOTING task_state in the database in compute-api and then send an RPC cast. That seems awfully risky given that if that message gets lost or the call fails for any reason we could end up stuck in the REBOOTING state forever. I think it might make sense to have the power state audit clear the REBOOTING state if appropriate, but others with more experience should make that call. It didn't happen to us, but I think we could get into this state another way: 1) nova-compute was running reboot_instance() 2) we reboot the controller 3) reboot_instance() times out trying to update the instance with the the new power state and a task_state of None. 4) Later on in _sync_power_states() we would update the power_state, but nothing would update the task_state. The timeline that I have looks like this. We had some buggy code that sent all the instances for a reboot when the controller came up. The first two are in the controller logs below, and these are the ones that failed. controller: (running everything but nova-compute) nova-api log: /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 202 len: 185 time: 0.2299521 /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 202 len: 185 time: 0.1583798 After this there are other reboot requests for the other instances, and those ones passed. Interestingly, we later see this /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action HTTP/1.1" status: 409 len: 303 time: 0.1177511 /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT /var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 192.168.204.195 "POST /v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action HTTP/1.1" status: 409 len: 303 time: 0.0366399 Presumably the 409 responses are because nova thinks that these instances are currently rebooting. compute: 2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672 2014-03-20 11:33:35.871 12229 INFO nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT 2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance 2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver [req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT 2014-03-20 11:33:38.367 12229 INFO nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT 2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance 2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver [req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT <etc> As you can see, the two requests that got "lost" were sent during the period after the first batch of AMQP connections, but before the second batch. I didn't log controller-side timestamps for the successful reboot requests but they were after the two failed ones.
2014-03-26 16:44:15 Tracy Jones nova: milestone ongoing
2014-03-26 16:44:35 Tracy Jones tags compute
2014-03-27 15:06:09 Ian Jolliffe bug added subscriber Ian Jolliffe
2014-04-03 20:52:43 melanie witt nova: status New Incomplete
2014-04-03 22:23:22 melanie witt nova: importance Undecided High
2014-04-03 22:23:22 melanie witt nova: status Incomplete Confirmed
2014-10-10 06:02:41 Kentaro Takeda bug added subscriber Kentaro Takeda
2015-04-02 08:45:21 Eli Qiao nova: status Confirmed Incomplete
2015-04-02 10:35:28 Eli Qiao nova: assignee Eli Qiao (taget-9)
2015-06-08 12:23:33 Markus Zoeller (markus_z) nova: assignee Eli Qiao (taget-9)
2015-07-07 12:24:00 Yashpal bug added subscriber Yashpal
2016-03-22 18:32:56 Markus Zoeller (markus_z) nova: status Incomplete Won't Fix
2016-03-22 18:33:01 Markus Zoeller (markus_z) nova: importance High Undecided
2016-03-22 18:33:04 Markus Zoeller (markus_z) nova: milestone ongoing
2016-09-02 17:53:16 Gustavo Randich bug added subscriber Gustavo Randich