l3-ha: a router can be stuck in the ALLOCATING state

Bug #1609738 reported by John Schwarz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
John Schwarz

Bug Description

The scenario is a simple one: during the creation of a router, the server that deals with the request crashes after creating the router with the ALLOCATING state [1] but before it's changed to ACTIVE [2]. In this case, the router will be "stuck" in the ALLOCATING and the only admin action to change the router back to ACTIVE (and allow it to be scheduled to agents) is:

1. set admin-state-up to False
2. set ha to False
3. set ha to True
4. set admin-state-up to True

That is, a full migration of the HA router to legacy and back to HA is required. This will trigger the code in [3] and will fix this issue. However, these 4 steps aren't intuitive at all - why should a user re-set the router as an HA to solve a weird state of the router?

Skipping steps 2 and 3 (only re-setting the admin-state-up) won't work because, as mentioned before, the scheduling happens on steps 2 and 3 (i.e. when the router is set to ha=False it's unscheduled, and when it's set to ha=True it is scheduled as if it's a new router). In fact, this means that the problem is more severe: if the server crashed in the middle of setting up the resources of an HA router, all 4 steps must be done to ensure the router is made valid again.

The proposed solution is to add a new state, such that if admin-state-up is changed to False then the router's status will be changed to "DOWN" (as opposed to the current "ACTIVE", which doesn't make much sense since admin-state-up is False). This will help mitigate the "stuck ALLOCATING status" portion of the problem.

In addition to changing the status, we will need to change the logic such that a router is unscheduled on admin-state-up=False and scheduled on admin-state-up=True. This will let us skip steps 2 and 3 and go straight for re-setting the admin-state-up attribute of a router, which is more intuitive.

[1]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L469
[2]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L485
[3]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L570

John Schwarz (jschwarz)
Changed in neutron:
assignee: nobody → John Schwarz (jschwarz)
John Schwarz (jschwarz)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/352081

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/352082

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Can someone tell me what ALLOCATING mean? The only documentation I could find was a commit message, which is pretty inadequate for something at affects the API.

It seems to me that the ALLOCATING status was introduced to prevent the server from selecting a router for scheduling whilst the underlying resources were being built, but that's effectively what the admin_status_up is supposed to be for.

If a server crashes whilst a router is in ALLOCATING state, to make a recover it, a user should just delete and start again. The router is in a bad state anyway. Otherwise, a server at startup should scan for routers to be 'healed'. But again, I argue whether the ALLOCATING status is required at all.

Revision history for this message
Assaf Muller (amuller) wrote :

> It seems to me that the ALLOCATING status was introduced to prevent the server from selecting a router for scheduling whilst the underlying resources were being built, ...

Correct.

> ... but that's effectively what the admin_status_up is supposed to be for.

I disagree. admin_state is a way for the admin to give Neutron user input, the ALLOCATING state has nothing to do with that, it's read only.

Revision history for this message
John Schwarz (jschwarz) wrote :

The motivation for the ALLOCATING state is that we were reaching a lot of race conditions in the L3 allocation process that the simplest way to fix most of them was to find a way to track the fact that a router's resources are currently being allocated and that the router shouldn't be passed on to the agents. Since admin_state_up is something that is modifiable by the user ("Hey! Why did my router suddenly turn itself off?" and "Hey! It's back on again! Ghosts in the Neutron!!"), status seemed more suitable as it can contain actual meaning ("Ah, the router is currently allocating resources... OK...")

After all, we wouldn't want the user to change the admin_state_up of a router even though the server is currently allocating resources for it - that would defeat the entire purpose.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I think I was not very clear in parts of comments #3: I did not mean to imply to use the admin status in lieu of the operational status, but what I simply meant to say is that preventing an agent from scheduling an agent is something that the admin_status is for. Now, I totally agree that improving the router state machine can help address the races observed in L3 HA. Forcing the status of the router because it's admin disabled. I'd argue that a router in ALLOCATING status should be DOWN to start with and that's irrespective of whether it's disabled or not.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Forcing the status of the router to DOWN because it's admin disabled...is wrong IMO :)

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

When I go back to the bug report description, I ask myself this simple question: do we want the user to be involved in the recovery action at all? How long does 'stuck' mean? One minute? 10 minutes? Should the router eventually go into ERROR? I personally think it's dangerous to assume that the user is required/capable to participate in the recovery process of a router setup, this to tells me that our platform internals have not been designed the way they should have to make the system self-healing and make the router reach a steady state in face of failures.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

We have a discussion with John about this,the only way to have router "healed itself" is to introduce some kind of health check that will verify how long router is in 'ALLOCATING' state. But this approach introduce questions like how long is it OK for router to be 'ALLOCATING'? I'm not sure that we can come up with time suitable for all situations.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

And a user can? My concern is that if the system can't, then perhaps neither can the user, reliably anyway unless he/she waited really long. Is it possible that we're missing something in the router state machine that would allow the system to be a little more clever?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/357965

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/357966

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/357966

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/357965

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/352082

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/352081

Revision history for this message
John Schwarz (jschwarz) wrote :

The direction we are (currently) going with is reverting the ALLOCATING patch altogether. This has been postponed to later on, hopefully in O.

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/386077

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/357966
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/317949
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2ad9c679ed8718633732da1e97307f9fd9647dcc
Submitter: Jenkins
Branch: master

commit 2ad9c679ed8718633732da1e97307f9fd9647dcc
Author: John Schwarz <email address hidden>
Date: Thu Oct 13 13:54:07 2016 +0300

    Don't create HA resources until needed

    Change I3447ea5bcb7c57365c6f50efe12a1671e86588b3 introduced a new
    running-index for RouterL3AgentBinding, binding_index, which helps to
    keep count of how many bindings a router has for each agent (and how
    many bindings in total). Since we were able use this DB column to make
    sure concurrency doesn't break on creating a new HA router, we also
    postponed the creation of L3HARouterAgentPortBinding to after the first
    binding was successfully created.

    This patch proposes a change to the way routers are scheduled to an
    agent: when creating a new HA router, no L3HARouterAgentPortBinding
    entities will be created until after the corresponding
    RouterL3AgentBinding was successfully created.
    In other words, instead of pre-creating the L3HARouterAgentPortBinding
    objects without assigning it to an agent, we'll create them only after
    the RouterL3AgentBinding were successfully created.

    Related-Bug: #1609738
    Change-Id: Ie98d5e3760cdb17450aea546f4b61f5ba14baf1c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/357966
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3e4c0ae223de776732f70626b387fba4de2e9c3f
Submitter: Jenkins
Branch: master

commit 3e4c0ae223de776732f70626b387fba4de2e9c3f
Author: John Schwarz <email address hidden>
Date: Fri Aug 19 15:23:36 2016 +0100

    Revert "Add ALLOCATING state to routers"

    This reverts commit 9c3c19f07ce52e139d431aec54341c38a183f0b7.

    Following the merge of Ie98d5e3760cdb17450aea546f4b61f5ba14baf1c, the
    creation of new router uses RouterL3AgentBinding and its' new
    binding_index attribute to ensure correctness of the resources. As such,
    the ALLOCATING state (which was used to do just that) is no longer
    needed and can be removed.

    Closes-Bug: #1609738
    Change-Id: Ib04e08df13ef4e6b94bd588854a5795163e2a617

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b3

This issue was fixed in the openstack/neutron 10.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/386077
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e4b0b9f8be165e688ee56a5063a3ccc91e823636
Submitter: Jenkins
Branch: master

commit e4b0b9f8be165e688ee56a5063a3ccc91e823636
Author: John Schwarz <email address hidden>
Date: Thu Oct 13 16:10:07 2016 +0300

    Refactor L3 scheduler (unify code paths)

    This patch proposes a (rather major) refactor to the L3 scheduler.
    Basically, the auto_schedule_routers() code-path was split to 2
    different code-paths, each dealing with a different case (unscheduled
    routers vs underscheduled routers), in addition to the API-initiated
    schedule() logic. This patch removes the 2 code-paths in favor of moving
    most of the logic into schedule(). While the result is a slightly
    longer schedule(), the benefit is that a lot of the previous
    unmaintainable code-paths of auto_schedule_routers() are now removed.

    Yay! :D

    Related-Bug: #1609738
    Change-Id: I227ca60422545e40d3bbb8baf2b41a8ce14f4294

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/444387

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/444388

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/444387
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b3f4e128417dbeaa18081a39497b1c1705483cc6
Submitter: Jenkins
Branch: stable/newton

commit b3f4e128417dbeaa18081a39497b1c1705483cc6
Author: John Schwarz <email address hidden>
Date: Thu Oct 13 13:54:07 2016 +0300

    Don't create HA resources until needed

    Change I3447ea5bcb7c57365c6f50efe12a1671e86588b3 introduced a new
    running-index for RouterL3AgentBinding, binding_index, which helps to
    keep count of how many bindings a router has for each agent (and how
    many bindings in total). Since we were able use this DB column to make
    sure concurrency doesn't break on creating a new HA router, we also
    postponed the creation of L3HARouterAgentPortBinding to after the first
    binding was successfully created.

    This patch proposes a change to the way routers are scheduled to an
    agent: when creating a new HA router, no L3HARouterAgentPortBinding
    entities will be created until after the corresponding
    RouterL3AgentBinding was successfully created.
    In other words, instead of pre-creating the L3HARouterAgentPortBinding
    objects without assigning it to an agent, we'll create them only after
    the RouterL3AgentBinding were successfully created.

    Conflicts:
     neutron/scheduler/l3_agent_scheduler.py
     neutron/tests/unit/db/test_l3_hamode_db.py

    Related-Bug: #1609738
    Change-Id: Ie98d5e3760cdb17450aea546f4b61f5ba14baf1c
    (cherry picked from commit 2ad9c679ed8718633732da1e97307f9fd9647dcc)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/newton)

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/444388
Reason: Abandoned as per comments

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.