RUG cannot recover powered down router instances

Bug #1468045 reported by Adam Gandelman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Astara
Fix Released
Medium
Adam Gandelman
akanda
Fix Released
Medium
Adam Gandelman

Bug Description

If the RUG starts up but one of its managed VMs is powered off (but still exists), the RUG cannot recover. It looks like the expected behavior is the appliance fails its liveliness check and the RUG deletes and creates a new one, but this fails with:

06-23 11:00:46.499 INFO akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] Booting router
2015-06-23 11:00:46.659 ERROR akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] Router failed to start boot
2015-06-23 11:00:46.659 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 Traceback (most recent call last):
2015-06-23 11:00:46.659 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 File "/opt/stack/akanda-rug/akanda/rug/vm_manager.py", line 217, in boot
2015-06-23 11:00:46.659 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 make_vrrp_ports
2015-06-23 11:00:46.659 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 File "/opt/stack/akanda-rug/akanda/rug/api/nova.py", line 168, in boot_instance
2015-06-23 11:00:46.659 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 self.client.servers.delete(instance_info.id_)
2015-06-23 11:00:46.659 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 AttributeError: 'NoneType' object has no attribute 'id_'
2015-06-23 11:00:46.659 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125
2015-06-23 11:00:46.660 DEBUG akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] CreateVM attempt 1/2 from (pid=9645) execute /opt/stack/akanda-rug/akanda/rug/state.py:232
2015-06-23 11:00:46.660 DEBUG akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] CreateVM.execute -> poll vm.state=down from (pid=9645) update /opt/stack/akanda-rug/akanda/rug/state.py:433
2015-06-23 11:00:46.660 DEBUG akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] CreateVM.transition(poll) -> CreateVM vm.state=down from (pid=9645) update /opt/stack/akanda-rug/akanda/rug/state.py:448
2015-06-23 11:00:46.660 DEBUG akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] CreateVM.execute(poll) vm.state=down from (pid=9645) update /opt/stack/akanda-rug/akanda/rug/state.py:427
2015-06-23 11:00:46.661 DEBUG oslo_messaging._drivers.amqpdriver:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] MSG_ID is a2d03520f91f4d31898480be9bc3caba from (pid=9645) _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:322
2015-06-23 11:00:46.776 INFO akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] Booting router
2015-06-23 11:00:46.825 ERROR akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384b2-0ae7-4bb3-a4d4-49327641515c None None] Router failed to start boot
2015-06-23 11:00:46.825 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 Traceback (most recent call last):
2015-06-23 11:00:46.825 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 File "/opt/stack/akanda-rug/akanda/rug/vm_manager.py", line 217, in boot
2015-06-23 11:00:46.825 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 make_vrrp_ports
2015-06-23 11:00:46.825 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 File "/opt/stack/akanda-rug/akanda/rug/api/nova.py", line 168, in boot_instance
2015-06-23 11:00:46.825 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 self.client.servers.delete(instance_info.id_)
2015-06-23 11:00:46.825 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125 AttributeError: 'NoneType' object has no attribute 'id_'
2015-06-23 11:00:46.825 TRACE akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125
2015-06-23 11:00:46.825 DEBUG akanda.rug.state.cb2afe8d-b797-4da8-b892-6a4d1ac38125:9645:p00:t00 [req-ba4384

I hit this attempting to re-stack a rebooted devstack VM but the issue would exist after a datacenter reboot. I wonder if it makes sense to short-cut the recreate and just restart existing but powered off instances

Changed in akanda:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to akanda-rug (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/205272

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to akanda-rug (master)

Reviewed: https://review.openstack.org/194841
Committed: https://git.openstack.org/cgit/stackforge/akanda-rug/commit/?id=c118c2ff21368e05931841bc127aa52b3288ac80
Submitter: Jenkins
Branch: master

commit c118c2ff21368e05931841bc127aa52b3288ac80
Author: Adam Gandelman <email address hidden>
Date: Tue Jun 23 14:57:15 2015 -0700

    Fix ability to recover from an existing appliance VM

    This fixes two issues discovered when trying to reboot a devstack
    host, which is similar to a full datacenter reboot.

    First, the RUG currently tries to delete any existing non-alive VMs
    it finds for a router. Its failing to do that ATM, attempting to
    use an ID from the wrong object.

    With that fixed, the VM manager and state machine are still incapable of
    removing and starting with a fresh appliance VM. Currently, the pass
    through the state machine that detects and deletes the existing VM counts
    as a negative boot attempt and the execution does not clear its cached
    instance_info after its deleted the instance. This causes another pass
    through the state with a stale instance_info that again counts against
    the boot count. That causes the router to get set to ERROR and future
    POLL events to get dropped, leading to no replacement router to booting.

    Change-Id: I7d1a0a58886088cf279a68b5aaf4cff2a678e16a
    Closes-bug: #1468045

Changed in akanda:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to akanda-rug (stable/kilo)

Reviewed: https://review.openstack.org/205272
Committed: https://git.openstack.org/cgit/stackforge/akanda-rug/commit/?id=c4949c6e3c903ebd092b8ba992968ea02f2e6853
Submitter: Jenkins
Branch: stable/kilo

commit c4949c6e3c903ebd092b8ba992968ea02f2e6853
Author: Adam Gandelman <email address hidden>
Date: Tue Jun 23 14:57:15 2015 -0700

    Fix ability to recover from an existing appliance VM

    This fixes two issues discovered when trying to reboot a devstack
    host, which is similar to a full datacenter reboot.

    First, the RUG currently tries to delete any existing non-alive VMs
    it finds for a router. Its failing to do that ATM, attempting to
    use an ID from the wrong object.

    With that fixed, the VM manager and state machine are still incapable of
    removing and starting with a fresh appliance VM. Currently, the pass
    through the state machine that detects and deletes the existing VM counts
    as a negative boot attempt and the execution does not clear its cached
    instance_info after its deleted the instance. This causes another pass
    through the state with a stale instance_info that again counts against
    the boot count. That causes the router to get set to ERROR and future
    POLL events to get dropped, leading to no replacement router to booting.

    Change-Id: I7d1a0a58886088cf279a68b5aaf4cff2a678e16a
    Closes-bug: #1468045

tags: added: in-stable-kilo
Sean Roberts (sarob)
Changed in akanda:
assignee: nobody → Adam Gandelman (gandelman-a)
importance: Undecided → Medium
milestone: none → liberty-2
Sean Roberts (sarob)
Changed in akanda:
status: Fix Committed → Fix Released
Changed in astara:
status: New → Fix Released
importance: Undecided → Medium
assignee: nobody → Adam Gandelman (gandelman-a)
milestone: none → 7.0.0
Changed in akanda:
milestone: liberty-2 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.